Mounting a Text Class Collection
From DLXS Documentation
(→Setting up directories) |
|||
Line 186: | Line 186: | ||
==[[Working with Fabricated Regions]]== | ==[[Working with Fabricated Regions]]== | ||
+ | When you use XPAT in combination with xmlrgn or sgmlrgn and a DTD, or multirgn and a tags file, you are identifying the elements and attributes in the DTD or tags file as "regions," containers of content rather like fields in a database. These separate regions are built into the regions file (collid.rgn) and are identified in the data dictionary (collid.dd). This is what is happening when you are running sgmlrgn and/or xmlrgn. | ||
+ | |||
+ | However, sometimes the things you want to identify collectively aren't so handily identified as elements in the DTD. For example, suppose you want to search within specific features of a book, such as a chapter, that can occur at different heirarchical levels in different volumes. Also, the element isn't even called CHAPTER; it's a numbered division with a type attribute telling you that it's a chapter. | ||
+ | |||
+ | In order to fabricate a region containing all the divisions in books that are chapters, for example you can first find all the regions with a query: | ||
+ | |||
+ | (region DIV1 incl (region "DIV1-T" incl "type=chapter"))+ (region DIV2 incl (region "DIV2-T" incl "type=chapter")) | ||
+ | |||
+ | You could do a more complex search based on treating attributes as regions instead of text strings, which is functionally the same: | ||
+ | |||
+ | (region DIV1 incl (region "DIV1-T" incl (region "A-TYPE" incl chapter)))+ (region DIV2 incl (region "DIV2-T" incl (region "A-TYPE" incl chapter))) | ||
+ | |||
+ | Finally, once you have a query that produces the results you want, create a file for export, export it, and sync it: | ||
+ | |||
+ | {exportfile "$DLXSROOT/idx/c/collid/chapter.rgn"} | ||
+ | export | ||
+ | ~sync "chapter" | ||
+ | |||
+ | Why fabricate regions? Why not just put these queries in the map file and call them chapters? While you could, it's probably worth your time to build these succinctly-named and precompiled regions; query errors are more easily identified in the index building than in the CGI, and XPAT searches can be simpler and quicker for terms within the prebuilt regions. | ||
+ | |||
+ | Fabricated regions within Text Class can be found in the extra.srch file for the sample collection at <tt>$DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.extra.srch</tt> . As with any other elements used in the interface for a given collection, fabricated regions used must also be represented in the collmgr entry and the map file for that collection. | ||
+ | |||
+ | = Fabricated regions required in Text Class = | ||
+ | |||
+ | * main | ||
+ | * mainheader | ||
+ | * maintitle | ||
+ | * div1head | ||
+ | |||
+ | = Fabricated regions commonly found in Text Class = | ||
+ | |||
+ | * mainauthor | ||
+ | * maindate page (for collections with page images) | ||
+ | * id (for collections with a number of different IDNO elements) | ||
+ | * divxhead (for collections nested below DIV1) | ||
==[[Customizing Text Class]]== | ==[[Customizing Text Class]]== |
Revision as of 14:35, 7 September 2007
Main Page > Mounting Collections: Class-specific Steps > Mounting a Text Class Collection
Abstract
The Text Class consists primarily of monographic material (books and pamphlets), but material such as journals, especially when converted from print and not subject to ongoing work, and letters can also be included. Whether current publishing or historical, whether the work was composed/edited for electronic distribution or print, whether the texts have been fully encoded or are automatically generated from raw OCR and a MARC record, these works are all:
- extended text, typically of prose, verse, or drama, and (significantly) combinations of these;
- typically with a high degree of structure;
- and frequently drawn together in large groups or collections
Although materials in the proposed Reference Class may become a part of such a collection, we tend to think of members of the Text Class as having a less predictable structure and application. While the typical uses of these materials have much to do with the behaviors we apply to them (e.g., members of the Text Class are often read at length or analyzed for linguistic purposes; a member of the proposed Reference Class is more typically consulted for information), we focus here primarily on the great variability of their organization and the relatively large bodies of material that are assembled. These two factors converge to lead us to treat the organizational characteristics more generically. In an important way, this class more than any other serves as a base class from which new classes grow or can be defined. For more information about the DLPS Classes model, please see the DLPS Classes Overview.
Description
Basic Characteristics
The Text Class can federate text collections for searching, but also allows access to each collection individually. Access minimally includes full text searching across collections or within a particular collection, bibliographic searching within document metadata, browsing by author or title, and creation of personal collections of texts.
General Characteristics
- Allows search and retrieval of electronic texts and portions thereof
- Allows searching across multiple collections simultaneously
- Allows searching of each collection independently
- Displays a text as HTML and/or as a page image, as appropriate
- Allows bookmarking of individual texts
- Requires minimal administrative data
- Uses a single data model and shared middleware for all collections in the system
- Permits access restrictions at the collection level
At the University of Michigan, the Text Class access system is administered by the Digital Library Production Service, Encoded Text Services. Encoded Text Services receives data in SGML or XML on a periodic basis from a variety of organizations on and off campus, and handles the process of putting the data online in a production-level environment.
Text Class provides no functionality for creating and managing electronic texts in SGML or XML.
Typical Sources of Data
The following sources of data have influenced the process defining the Text Class. Other sources are possible.
- Libraries: electronic texts and collections, both fully encoded and page images with OCR
- Publishers: both "born digital" and retrospectively converted texts
- Scholars/authors: electronic texts to support their research and/or teaching
Typical Applications of the Class
- Campus access
- Public access
Behaviors of the Text Class
- Cross-collection searching in any combination of collections
- Selection of collections by name or by attributes such as period covered, genre, etc.
- Collection-specific searching
- Simple, Boolean, Proximity, and Bibliography searching
- Searching within a user-selected text
- Ability to review and revise previous searches
- Browsing, either by a limited set of elements in the HEADER (author, title, subject), or with the "word wheel"
- Viewing of sections of a text or the full text in HTML (when available), page images (when available), or both (when applicable)
General Parameters
Electronic texts and electronic versions of previously printed monographs are available in SGML or XML from a number of sources, encoded in a number of different DTDs or schemas. While the content of each text is unique, the structure of each text is similar and often encoded in generalizable ways. The Text Class federates collections at the point of access in order to best support the diversity of texts and encoding practices.
This diversity becomes evident when searching across databases is attempted. It is a considerable achievement that the Text Class supports cross-collection searching, and that it does so without forcing all collections to be encoded at the same level. All collection specific elements are available to the user for search and display when doing a collection specific search. In the cross-collection search, all of the elements that the collections have in common are available; when a collection with a very low level of encoding is included, full-text searching may be the only option presented.
Dynamics
The data structure and behaviors of the Text Class best serve the functional requirements of collections of electronic texts grouped by genre, subject, or some other unifying theme. Individual texts can be served alone, but the methods and behaviors of the class will treat the text as a collection containing one item only.
Formal Data Definition
The Text Class relies on a single XML Document Type Definition (DTD) to deliver all collections in the class. The SGML or XML data in the Text Class is converted to the Text Class delivery DTD from the specific encoding DTD for the particular text or collection, often using XML Style Language Transformations (XSLT). The general principles are that there is a common nomenclature for common elements -- paragraphs are <P>, not <P> and <PARA> and <PARAGRAPH> -- and that nested elements of the same name are numbered -- for example, a note element can contain other note elements, so these would be <NOTE1> and <NOTE2> respectively. The Text Class delivery DTD is based roughly on the TEILite DTD as applied by the TEI in Libraries Draft Guidelines for Best Encoding Practices. The SGML is then indexed with XPAT and made searchable on the Web by the Text Class middleware.
Page images, when available, are either 600 dpi TIFFs delivered through the Text Class pageviewer mechanisms as GIFs or PDFs, or continuous tone images delivered though Text Class or via Image Class to provide pan and zoom functionality.
Examples of Text Class Implementations and Practices
Making of America, Twentieth Century American Poetry.
The decision process for inclusion of content in the Text Class is roughly based on the following:
- The text is a monograph or journal issue available in SGML or XML.
- Each text has a unique ID.
- The campus and/or world community would benefit from access to the information.
These are, at best, general guidelines for decision making based on the current state of the Text Class implementation at the University of Michigan.
Working with Text Class Markup
FIGURE Resolution
This document describes the mechanisms and programs used by DLXS for accessing and displaying images that are defined by the FIGURE element in TextClass document markup.
- [#figureelement The FIGURE Element]
- [#generalresolution FIGURE Resolution in General]
- [#idresolvermech The IdResolver Resolution Mechanism]
- [#filesystemmech The Filesystem Resolution Mechanism]
- [#defaultbehavior Default Behavior and Custom Configuration]
The FIGURE Element
The FIGURE element is used in TextClass markup to encode the occurrence of a figure in a text document. The FIGURE element's ENTITY attribute carries the ID that the middleware resolves to an image file on disk or to an image under management by ImageClass.
If the document is a TEI Level 1 (page image document viewable in Pageviewer) the FIGURE tag supplies additional data about the bitonal page image beyond that supplied by the PB tag. The additional data might be a second continuous tone scan of the entire page or a continuous tone scan of a plate within the page. The middleware makes the contone images viewable under ImageClass via links built in the full OCR text of the document (in cases where that OCR is displayable) or via [pageviewer.html#pageviewFig additional links in Pageviewer]. For higher TEI levels, the FIGURE tag typically calls out the occurrence of an inline figure, the image for which is usually on disk but may also be retrievable from ImageClass through the IdResolver mechanism described in the [pageviewer.html#DLXS_IDResolver IdResolver section] of the Pageviewer documentation.
The balance of this document describes how the FIGURE element is resolved into an image via two mechanisms: Filesystem resolution and IdResolver resolution. Also described is how the resolution can be affected by writing subclasses of TextClass.
FIGURE Resolution in General
The resolution mechanism is summarized in the following diagram
<FIGURE ENTITY="abc"> ? transform "abc" to key ? lookup key ? URL or file system path
The document markup is parsed and the ENTITY attribute value of a given FIGURE tag is extracted. The attribute value is transformed into a key/path suitable for lookup via the DLXS IdResolver or by looking on the disk. If IdResolver is used, the corresponding ImageClass URL is returned. If the filesystem is used, the path to the file in the web space is determined. This URL/path becomes the value of a new attribute which is added to the FIGURE tag and passed along for eventual processing by the XSL stylesheet (principally text.components.xsl). The XSL stylesheet typically transforms the FIGURE tag together with its attributes into an HTML img tag, possibly wrapped with an HTML anchor (a) tag, if a popup window or link into ImageClass is required to view the image of the figure.
Actually, depending on the situation, the middleware adds two or three new attributes to the FIGURE tag for XSL processing. They are FIGTYPE, HREF_1 and HREF_2. Collection-specific configuration referred to below is discussed in the [#defaultbehavior configuration section] later in this document.
- FIGTYPE carries one of two values:
- INLINE when the configuration is for figures that appear inline with the text. In this case, only of the HREF_1 attribute will appear.
- THUMB when the configuration is for figures that have thumbnail images that appear inline with the text. In this case, both of the HREF_* attributes will appear.
- HREF_1 is the URL or web space path to the inline image when FIGTYPE=INLINE or to a thumbnail image of the full sized image when FIGTYPE=THUMB.
- HREF_2 is the URL or web space path to the full sized version of the figure image when FIGTYPE=THUMB.
Note: The debug=resolver URL parameter can be added to the end of the URL to see the action of the resolver as it operates on the ENTITY attribute of the FIGURE tag.
The IdResolver Resolution Mechanism
If the configuration indicates figure images are "external", i.e. that the figure images are managed in ImageClass or by a 3rd-party host, the IdResolver is used to resolve the ID to an ImageClass or 3rd-party URL. The Idresolver mechanism is described in the [pageviewer.html#DLXS_IDResolver IdResolver section] of the Pageviewer documentation.
The Filesystem Resolution Mechanism
If the configuration indicates figure images are not "external", the figure images should be found as files in the web space. The middleware constructs a default path to the image in the web speace as /webdir/images/ENTITY.extension where webdir is the collmgr value for the collection and extension comes from a list (.gif, .jpg, etc.). The middleware tests for file existence in the web space for each extension until a hit occurs. This allows files of several different formats to coexist in the web space.
The DLXS directory convention is to store these image files in DLXSROOT/img/c/coll and make a symbolic link to that directoy in DLXSROOT/web/c/coll called images.
Default Behavior and Custom Configuration
The default for the figure resolution mechanism assumes all figures are inline, on disk, without corresponding thumbnail images. Within this constraint it is possible to change the way the path to the disk file is generated to derive a number of naming conventions based on the bare entity attribute value. This is described below.
Modifying the described configuration of the figure resolution mechanism is accomplished by writing a subclass of the TextClass package for each affected collection. The size of the methods that need to be written is small, typically just a line or two of code.
Following is a synopsis of the methods provided for subclassing. Please consult the code in DLXSROOT/cgi/t/text/TextClass.pm and its subclasses for more detail.
- ExternalFigFullImage by default returns 0 indicating that figure images are on disk. Override this and return 1 to indicate images should be resolved into ImageClass or 3rd-party URLs by IdResolver.
- ExternalFigThumb by default returns 0 indicating that thumbnail images of figures are on disk. Override this and return 1 to indicate thumbnail images should be resolved into ImageClass or 3rd-party URLs by IdResolver.
- CollUsesThumbs by default returns 0. Override this and return 1 to indicate that figures should be rendered as inline thumbnails linked to a full sized version of the image. Note that thumbnails and their full sized counterparts can reside under ImageClass or be accessed directly from the disk.
- GenerateExternalId returns the default format for the IdResolver ID (key) as ic-collid-idno-ENTITY-[th|rec] where th is for the thumb and rec is for the full sized image. ENTITY is the value of the ENTITY attribute in the <FIGURE> tag in the XML. By overriding this method, different ID schemes can be implemented.
- GenerateFilesysPathElement by default simply returns the element for concatenation as described above. In the case where CollUsesThumbs is true and thumbnails and full sized images are on disk, this method can be overridden to supply different names for the thumbnails and full sized image files.
- FigureUriHook by default returns the URL passed in. For collections where the URL of the image needs to be wrapped in a javascript call this method can be subclassed.
- SyntheticAttrsHook can be subclassed to add additional attributes to the FIGURE tag for specialized process by the XSL stylesheets for collection-specific customization.
- FilesysIdResolver This method can be subclassed to supply a custom list of file extensions to check if the base class default list is not inclusive enough. In the subclass implementation, the list is passed as the optional last parameter to an invocation of the base class method, i.e. as SUPER::FilesysIdResolver.
Working with the Text Class DTD
Working with Text Class Data and Directories
Setting up directories
You will need to identify directories where you plan to store your source files, your converted and concatenated Text Class XML file, your index file (approximately 75% of the size of your SGML source), your "region" files and other information such as data dictionaries, and files you use to prepare your data. We recommend you use the following structure:
- Store specialized scripts for your collection and its Makefile in $DLXSROOT/bin/c/collid/where $DLXSROOT is the "tree" where you install all DLXS components, c is the first letter of the name of the collection you are indexing, and collid is the collection ID of the collection you are indexing. For example, if your collection ID is "moa" and your DLXSROOT is "/l1", you will place the Makefile in /l1/bin/m/moa/, e.g., /l1/bin/m/moa/Makefile. See directory conventions for more information.
- Store your source texts and any DTDs, doctype, and files for preparing your data in $DLXSROOT/prep/c/collid/. Unlike the contents of other directories, everything in prep should be ultimately expendable in the production environment.
- Store the finalized, concatenated Text Class XML file for your text collection in $DLXSROOT/obj/c/collid/ , e.g., /l1/obj/m/moa/moa.xml.
- Store index, region, data dictionary, and init files in $DLXSROOT/idx/c/collid/, e.g., /l1/idx/m/moa/moa.idx. See the XPAT documentation for more on these types of files.
The files that are located in $DLXSROOT/bin/s/sampletc_utf8 and$DLXSROOT/prep/s/sampletc_utf8 should be copied into your collection directories and used to index your collection. The following files may need to be editted so that the #! points to your location of perl:
- $DLXSROOT/bin/t/text/isolat128bit.pl
- $DLXSROOT/bin/t/text/output.dd.frag.pl
- $DLXSROOT/bin/t/text/inc.extra.dd.pl
The following files will need to be edited to reflect your collection names and paths:
- $DLXSROOT/bin/s/sampletc_utf8/Makefile
- $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.blank.dd
- $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.extra.srch
- $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.inp
Data Preparation and Conversion
normalizing and validation workshop doc would go here
Working with Fabricated Regions
When you use XPAT in combination with xmlrgn or sgmlrgn and a DTD, or multirgn and a tags file, you are identifying the elements and attributes in the DTD or tags file as "regions," containers of content rather like fields in a database. These separate regions are built into the regions file (collid.rgn) and are identified in the data dictionary (collid.dd). This is what is happening when you are running sgmlrgn and/or xmlrgn.
However, sometimes the things you want to identify collectively aren't so handily identified as elements in the DTD. For example, suppose you want to search within specific features of a book, such as a chapter, that can occur at different heirarchical levels in different volumes. Also, the element isn't even called CHAPTER; it's a numbered division with a type attribute telling you that it's a chapter.
In order to fabricate a region containing all the divisions in books that are chapters, for example you can first find all the regions with a query:
(region DIV1 incl (region "DIV1-T" incl "type=chapter"))+ (region DIV2 incl (region "DIV2-T" incl "type=chapter"))
You could do a more complex search based on treating attributes as regions instead of text strings, which is functionally the same:
(region DIV1 incl (region "DIV1-T" incl (region "A-TYPE" incl chapter)))+ (region DIV2 incl (region "DIV2-T" incl (region "A-TYPE" incl chapter)))
Finally, once you have a query that produces the results you want, create a file for export, export it, and sync it:
{exportfile "$DLXSROOT/idx/c/collid/chapter.rgn"} export ~sync "chapter"
Why fabricate regions? Why not just put these queries in the map file and call them chapters? While you could, it's probably worth your time to build these succinctly-named and precompiled regions; query errors are more easily identified in the index building than in the CGI, and XPAT searches can be simpler and quicker for terms within the prebuilt regions.
Fabricated regions within Text Class can be found in the extra.srch file for the sample collection at $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.extra.srch . As with any other elements used in the interface for a given collection, fabricated regions used must also be represented in the collmgr entry and the map file for that collection.
Fabricated regions required in Text Class
- main
- mainheader
- maintitle
- div1head
Fabricated regions commonly found in Text Class
- mainauthor
- maindate page (for collections with page images)
- id (for collections with a number of different IDNO elements)
- divxhead (for collections nested below DIV1)
Customizing Text Class
This could provide other topics, but I think it would be a good place for a discussion of the pieces of text class that are customizable ranked by level of difficulty. Then you could link to the places in the doc that discuss specific methods for customizing different bits.
Building the Index
Mounting the Collection Online
Text Class Collection to Web from workshop doc would go in here
Working with Page Image Access Mechanisms in Text Class
Working with the User Interface
Specifying Individual Collection Characteristics
Could we re-label this topic with a friendlier title? Perhaps "Customizing Text Class UI Layout"or something.