Mounting a Text Class Collection
From DLXS Documentation
Line 1: | Line 1: | ||
[[DLXS Wiki|Main Page]] > [[Mounting Collections: Class-specific Steps]] > Mounting a Text Class Collection | [[DLXS Wiki|Main Page]] > [[Mounting Collections: Class-specific Steps]] > Mounting a Text Class Collection | ||
- | = | + | <div class="pageTitle">Overview of Text Class</div> |
+ | = Abstract = | ||
+ | |||
+ | The Text Class consists primarily of monographic material (books and pamphlets), but material such as journals, especially when converted from print and not subject to ongoing work, and letters can also be included. Whether current publishing or historical, whether the work was composed/edited for electronic distribution or print, whether the texts have been fully encoded or are automatically generated from raw OCR and a MARC record, these works are all: | ||
+ | |||
+ | # extended text, typically of prose, verse, or drama, and (significantly) combinations of these; | ||
+ | # typically with a high degree of structure; | ||
+ | # and frequently drawn together in large groups or collections | ||
+ | |||
+ | Although materials in the proposed Reference Class may become a part of such a collection, we tend to think of members of the Text Class as having a less predic | ||
+ | table structure and application. While the ''typical'' uses of these materials have much to do with the behaviors we apply to them (e.g., members of the Text Class are often read at length or analyzed for linguistic purposes; a member of the proposed Reference Class is more typically consulted for information), we focus here primarily on the great variability of their organization and the relatively large bodies of material that are assembled. These two factors converge to le | ||
+ | ad us to treat the organizational characteristics more generically. In an import | ||
+ | ant way, this class more than any other serves as a base class from which new cl | ||
+ | asses grow or can be defined. For more information about the DLPS Classes model, please see the [../index.html DLPS Classes Overview]. | ||
+ | |||
+ | = Description = | ||
+ | |||
+ | == Basic Characteristics == | ||
+ | |||
+ | The Text Class can federate text collections for searching, but also allows access to each collection individually. Access minimally includes full text searching across collections or within a particular collection, bibliographic searching within document metadata, browsing by author or title, and creation of personal collections of texts. | ||
+ | |||
+ | == General Characteristics == | ||
+ | |||
+ | * Allows search and retrieval of electronic texts and portions thereof | ||
+ | * Allows searching across multiple collections simultaneously | ||
+ | * Allows searching of each collection independently | ||
+ | * Displays a text as HTML and/or as a page image, as appropriate | ||
+ | * Allows bookmarking of individual texts | ||
+ | * Requires minimal administrative data | ||
+ | * Uses a single data model and shared middleware for all collections in the syst | ||
+ | em | ||
+ | * Permits access restrictions at the collection level | ||
+ | |||
+ | At the University of Michigan, the Text Class access system is administered by t | ||
+ | he Digital Library Production Service, Encoded Text Services. Encoded Text Servi | ||
+ | ces receives data in SGML or XML on a periodic basis from a variety of organizat | ||
+ | ions on and off campus, and handles the process of putting the data online in a | ||
+ | production-level environment. | ||
+ | |||
+ | Text Class provides no functionality for creating and managing electronic texts | ||
+ | in SGML. | ||
+ | |||
+ | === Typical Sources of Data === | ||
+ | |||
+ | The following sources of data have influenced the process defining the Text Clas | ||
+ | s. Other sources are possible. | ||
+ | |||
+ | * Libraries: electronic texts and collections, both fully encoded and page image | ||
+ | s with OCR | ||
+ | * Publishers: both "born digital" and retrospectively converted texts | ||
+ | * Scholars/authors: electronic texts to support their research and/or teaching | ||
+ | |||
+ | === Typical Applications of the Class === | ||
+ | |||
+ | * Campus access | ||
+ | * Public access | ||
+ | |||
+ | === Behaviors of the Text Class === | ||
+ | |||
+ | * Cross-collection searching in any combination of collections | ||
+ | * Selection of collections by name or by attributes such as period covered, genr | ||
+ | e, etc. | ||
+ | * Collection-specific searching | ||
+ | * Simple, Boolean, Proximity, and Bibliography searching | ||
+ | * Searching within a user-selected text | ||
+ | * Ability to review and revise previous searches | ||
+ | * Browsing, either by a canned query embedded in a link, or with the "word wheel | ||
+ | " | ||
+ | * Viewing of sections of a text or the full text in HTML (when available), page | ||
+ | images (when available), or both (when applicable) | ||
+ | |||
+ | == General Parameters == | ||
+ | |||
+ | Electronic texts and electronic versions of previously printed monographs are av | ||
+ | ailable in SGML from a number of sources, encoded in a number of different DTDs. While the content of each text is unique, the structure of each text is similar and often encoded in generalizable ways. The Text Class federates collections at the point of access in order to best support the diversity of texts and encoding practices. | ||
+ | |||
+ | This diversity becomes evident when searching across databases is attempted. It | ||
+ | is a considerable achievement that the Text Class supports cross-collection searching, and that it does so without forcing all collections to be encoded at the | ||
+ | same level. All collection specific elements are available to the user for search and display when doing a collection specific search. In the cross-collection search, all of the elements that the collections have in common are available; when a collection with a very low level of encoding is included, full-text searching may be the only option presented. | ||
+ | |||
+ | == Dynamics == | ||
+ | |||
+ | The data structure and behaviors of the Text Class best serve the functional requirements of collections of electronic texts grouped by genre, subject, or some | ||
+ | other unifying theme. Individual texts can be served alone, but the methods and | ||
+ | behaviors of the class will treat the text as a collection containing one item only. | ||
+ | |||
+ | == Formal Data Definition == | ||
+ | |||
+ | The Text Class relies on a single SGML Document Type Definition (DTD) to deliver all collections in the class. The SGML data in the Text Class is converted to the Text Class delivery DTD from the specific encoding DTD for the particular text or collection, often using XML Style Language Transformations (XSLT). The general principles are that there is a common nomenclature for common elements -- paragraphs are <P>, not <P> and <PARA> and <PARAGRAPH> -- and that nested elements of the same name are numbered -- for example, a note element can contain other note elements, so these would be <NOTE1> and <NOTE2> respectively. The Text Class delivery DTD is based roughly on the TEILite DTD as applied by the [http://www.indiana.edu/~letrs/tei/ TEI in Libraries Draft Guidelines for Best Encoding Practices.] The SGML is then indexed with XPAT and made searchable on the Web by the Text Class middleware. | ||
+ | |||
+ | Page images, when available, are either 600 dpi TIFFs delivered through the Text Class pageviewer mechanisms as GIFs or PDFs, or continuous tone images delivered via Image Class. | ||
+ | |||
+ | == Representative Resource(s) == | ||
+ | |||
+ | [http://moa.umdl.umich.edu Making of America], [http://qoud.lib.umich.edu/a/ampo20/ Twentieth Century American Poetry]. | ||
+ | |||
+ | The decision process for inclusion of content in the Text Class is roughly based on the following: | ||
+ | |||
+ | * The text is a monograph or journal issue available in SGML or XML. | ||
+ | * Each text has a unique ID. | ||
+ | * The campus and/or world community would benefit from access to the information. | ||
+ | |||
+ | These are, at best, general guidelines for decision making based on the current state of the Text Class implementation at the University of Michigan. | ||
==[[Examples of Text Class Implementations and Practices]]== | ==[[Examples of Text Class Implementations and Practices]]== |
Revision as of 13:49, 7 September 2007
Main Page > Mounting Collections: Class-specific Steps > Mounting a Text Class Collection
Abstract
The Text Class consists primarily of monographic material (books and pamphlets), but material such as journals, especially when converted from print and not subject to ongoing work, and letters can also be included. Whether current publishing or historical, whether the work was composed/edited for electronic distribution or print, whether the texts have been fully encoded or are automatically generated from raw OCR and a MARC record, these works are all:
- extended text, typically of prose, verse, or drama, and (significantly) combinations of these;
- typically with a high degree of structure;
- and frequently drawn together in large groups or collections
Although materials in the proposed Reference Class may become a part of such a collection, we tend to think of members of the Text Class as having a less predic table structure and application. While the typical uses of these materials have much to do with the behaviors we apply to them (e.g., members of the Text Class are often read at length or analyzed for linguistic purposes; a member of the proposed Reference Class is more typically consulted for information), we focus here primarily on the great variability of their organization and the relatively large bodies of material that are assembled. These two factors converge to le ad us to treat the organizational characteristics more generically. In an import ant way, this class more than any other serves as a base class from which new cl asses grow or can be defined. For more information about the DLPS Classes model, please see the [../index.html DLPS Classes Overview].
Description
Basic Characteristics
The Text Class can federate text collections for searching, but also allows access to each collection individually. Access minimally includes full text searching across collections or within a particular collection, bibliographic searching within document metadata, browsing by author or title, and creation of personal collections of texts.
General Characteristics
- Allows search and retrieval of electronic texts and portions thereof
- Allows searching across multiple collections simultaneously
- Allows searching of each collection independently
- Displays a text as HTML and/or as a page image, as appropriate
- Allows bookmarking of individual texts
- Requires minimal administrative data
- Uses a single data model and shared middleware for all collections in the syst
em
- Permits access restrictions at the collection level
At the University of Michigan, the Text Class access system is administered by t he Digital Library Production Service, Encoded Text Services. Encoded Text Servi ces receives data in SGML or XML on a periodic basis from a variety of organizat ions on and off campus, and handles the process of putting the data online in a production-level environment.
Text Class provides no functionality for creating and managing electronic texts in SGML.
Typical Sources of Data
The following sources of data have influenced the process defining the Text Clas s. Other sources are possible.
- Libraries: electronic texts and collections, both fully encoded and page image
s with OCR
- Publishers: both "born digital" and retrospectively converted texts
- Scholars/authors: electronic texts to support their research and/or teaching
Typical Applications of the Class
- Campus access
- Public access
Behaviors of the Text Class
- Cross-collection searching in any combination of collections
- Selection of collections by name or by attributes such as period covered, genr
e, etc.
- Collection-specific searching
- Simple, Boolean, Proximity, and Bibliography searching
- Searching within a user-selected text
- Ability to review and revise previous searches
- Browsing, either by a canned query embedded in a link, or with the "word wheel
"
- Viewing of sections of a text or the full text in HTML (when available), page
images (when available), or both (when applicable)
General Parameters
Electronic texts and electronic versions of previously printed monographs are av ailable in SGML from a number of sources, encoded in a number of different DTDs. While the content of each text is unique, the structure of each text is similar and often encoded in generalizable ways. The Text Class federates collections at the point of access in order to best support the diversity of texts and encoding practices.
This diversity becomes evident when searching across databases is attempted. It is a considerable achievement that the Text Class supports cross-collection searching, and that it does so without forcing all collections to be encoded at the same level. All collection specific elements are available to the user for search and display when doing a collection specific search. In the cross-collection search, all of the elements that the collections have in common are available; when a collection with a very low level of encoding is included, full-text searching may be the only option presented.
Dynamics
The data structure and behaviors of the Text Class best serve the functional requirements of collections of electronic texts grouped by genre, subject, or some other unifying theme. Individual texts can be served alone, but the methods and behaviors of the class will treat the text as a collection containing one item only.
Formal Data Definition
The Text Class relies on a single SGML Document Type Definition (DTD) to deliver all collections in the class. The SGML data in the Text Class is converted to the Text Class delivery DTD from the specific encoding DTD for the particular text or collection, often using XML Style Language Transformations (XSLT). The general principles are that there is a common nomenclature for common elements -- paragraphs are <P>, not <P> and <PARA> and <PARAGRAPH> -- and that nested elements of the same name are numbered -- for example, a note element can contain other note elements, so these would be <NOTE1> and <NOTE2> respectively. The Text Class delivery DTD is based roughly on the TEILite DTD as applied by the TEI in Libraries Draft Guidelines for Best Encoding Practices. The SGML is then indexed with XPAT and made searchable on the Web by the Text Class middleware.
Page images, when available, are either 600 dpi TIFFs delivered through the Text Class pageviewer mechanisms as GIFs or PDFs, or continuous tone images delivered via Image Class.
Representative Resource(s)
Making of America, Twentieth Century American Poetry.
The decision process for inclusion of content in the Text Class is roughly based on the following:
- The text is a monograph or journal issue available in SGML or XML.
- Each text has a unique ID.
- The campus and/or world community would benefit from access to the information.
These are, at best, general guidelines for decision making based on the current state of the Text Class implementation at the University of Michigan.
Examples of Text Class Implementations and Practices
replaces the old "Representative Resources" section. Ideally this will eventually be exapanded to showcase sites you think do something really well or interesting, etc.
Working with Text Class Markup
FIGURE Resolution
Working with the Text Class DTD
Working with Text Class Data and Directories
workshop doc: storing texts and page images; set up directories for XPAT indexing; etc.
Data Preparation and Conversion
normalizing and validation workshop doc would go here
Working with Fabricated Regions
Customizing Text Class
This could provide other topics, but I think it would be a good place for a discussion of the pieces of text class that are customizable ranked by level of difficulty. Then you could link to the places in the doc that discuss specific methods for customizing different bits.
Building the Index
Mounting the Collection Online
Text Class Collection to Web from workshop doc would go in here
Working with Page Image Access Mechanisms in Text Class
Working with the User Interface
Specifying Individual Collection Characteristics
Could we re-label this topic with a friendlier title? Perhaps "Customizing Text Class UI Layout"or something.