|
|
(21 intermediate revisions not shown.) |
Line 1: |
Line 1: |
| [[DLXS Wiki|Main Page]] > [[Mounting Collections: Class-specific Steps]] > Mounting a Text Class Collection | | [[DLXS Wiki|Main Page]] > [[Mounting Collections: Class-specific Steps]] > Mounting a Text Class Collection |
| | | |
- | <div class="pageTitle">Overview of Text Class</div>
| + | ==[[Text Class Overview]]== |
| | | |
- | = Abstract = | + | ==Examples of Text Class Implementations and Practices== |
- | | + | |
- | The Text Class consists primarily of monographic material (books and pamphlets), but material such as journals, especially when converted from print and not subject to ongoing work, and letters can also be included. Whether current publishing or historical, whether the work was composed/edited for electronic distribution or print, whether the texts have been fully encoded or are automatically generated from raw OCR and a MARC record, these works are all:
| + | |
- | | + | |
- | # extended text, typically of prose, verse, or drama, and (significantly) combinations of these;
| + | |
- | # typically with a high degree of structure;
| + | |
- | # and frequently drawn together in large groups or collections
| + | |
- | | + | |
- | Although materials in the proposed Reference Class may become a part of such a collection, we tend to think of members of the Text Class as having a less predictable structure and application. While the ''typical'' uses of these materials have much to do with the behaviors we apply to them (e.g., members of the Text Class are often read at length or analyzed for linguistic purposes; a member of the proposed Reference Class is more typically consulted for information), we focus here primarily on the great variability of their organization and the relatively large bodies of material that are assembled. These two factors converge to lead us to treat the organizational characteristics more generically. In an important way, this class more than any other serves as a base class from which new classes grow or can be defined. For more information about the DLPS Classes model, please see the [[DLPS Classes Overview]].
| + | |
- | | + | |
- | = Description =
| + | |
- | | + | |
- | == Basic Characteristics ==
| + | |
- | | + | |
- | The Text Class can federate text collections for searching, but also allows access to each collection individually. Access minimally includes full text searching across collections or within a particular collection, bibliographic searching within document metadata, browsing by author or title, and creation of personal collections of texts.
| + | |
- | | + | |
- | == General Characteristics ==
| + | |
- | | + | |
- | * Allows search and retrieval of electronic texts and portions thereof
| + | |
- | * Allows searching across multiple collections simultaneously
| + | |
- | * Allows searching of each collection independently
| + | |
- | * Displays a text as HTML and/or as a page image, as appropriate
| + | |
- | * Allows bookmarking of individual texts
| + | |
- | * Requires minimal administrative data
| + | |
- | * Uses a single data model and shared middleware for all collections in the system
| + | |
- | * Permits access restrictions at the collection level
| + | |
- | | + | |
- | At the University of Michigan, the Text Class access system is administered by the Digital Library Production Service, Encoded Text Services. Encoded Text Services receives data in SGML or XML on a periodic basis from a variety of organizations on and off campus, and handles the process of putting the data online in a production-level environment.
| + | |
- | | + | |
- | Text Class provides no functionality for creating and managing electronic texts
| + | |
- | in SGML or XML.
| + | |
- | | + | |
- | === Typical Sources of Data ===
| + | |
- | | + | |
- | The following sources of data have influenced the process defining the Text Class. Other sources are possible.
| + | |
- | | + | |
- | * Libraries: electronic texts and collections, both fully encoded and page images with OCR
| + | |
- | * Publishers: both "born digital" and retrospectively converted texts
| + | |
- | * Scholars/authors: electronic texts to support their research and/or teaching
| + | |
- | | + | |
- | === Typical Applications of the Class ===
| + | |
- | | + | |
- | * Campus access
| + | |
- | * Public access
| + | |
- | | + | |
- | === Behaviors of the Text Class ===
| + | |
- | | + | |
- | * Cross-collection searching in any combination of collections
| + | |
- | * Selection of collections by name or by attributes such as period covered, genre, etc.
| + | |
- | * Collection-specific searching
| + | |
- | * Simple, Boolean, Proximity, and Bibliography searching
| + | |
- | * Searching within a user-selected text
| + | |
- | * Ability to review and revise previous searches
| + | |
- | * Browsing, either by a limited set of elements in the HEADER (author, title, subject), or with the "word wheel"
| + | |
- | * Viewing of sections of a text or the full text in HTML (when available), page images (when available), or both (when applicable)
| + | |
- | | + | |
- | == General Parameters ==
| + | |
- | | + | |
- | Electronic texts and electronic versions of previously printed monographs are available in SGML or XML from a number of sources, encoded in a number of different DTDs or schemas. While the content of each text is unique, the structure of each text is similar and often encoded in generalizable ways. The Text Class federates collections at the point of access in order to best support the diversity of texts and encoding practices.
| + | |
- | | + | |
- | This diversity becomes evident when searching across databases is attempted. It
| + | |
- | is a considerable achievement that the Text Class supports cross-collection searching, and that it does so without forcing all collections to be encoded at the same level. All collection specific elements are available to the user for search and display when doing a collection specific search. In the cross-collection search, all of the elements that the collections have in common are available; when a collection with a very low level of encoding is included, full-text searching may be the only option presented.
| + | |
- | | + | |
- | == Dynamics ==
| + | |
- | | + | |
- | The data structure and behaviors of the Text Class best serve the functional requirements of collections of electronic texts grouped by genre, subject, or some
| + | |
- | other unifying theme. Individual texts can be served alone, but the methods and
| + | |
- | behaviors of the class will treat the text as a collection containing one item only.
| + | |
- | | + | |
- | == Formal Data Definition ==
| + | |
- | | + | |
- | The Text Class relies on a single XML Document Type Definition (DTD) to deliver all collections in the class. The SGML or XML data in the Text Class is converted to the Text Class delivery DTD from the specific encoding DTD for the particular text or collection, often using XML Style Language Transformations (XSLT). The general principles are that there is a common nomenclature for common elements -- paragraphs are <P>, not <P> and <PARA> and <PARAGRAPH> -- and that nested elements of the same name are numbered -- for example, a note element can contain other note elements, so these would be <NOTE1> and <NOTE2> respectively. The Text Class delivery DTD is based roughly on the TEILite DTD as applied by the [http://www.indiana.edu/~letrs/tei/ TEI in Libraries Draft Guidelines for Best Encoding Practices.] The SGML is then indexed with XPAT and made searchable on the Web by the Text Class middleware.
| + | |
- | | + | |
- | Page images, when available, are either 600 dpi TIFFs delivered through the Text Class pageviewer mechanisms as GIFs or PDFs, or continuous tone images delivered though Text Class or via Image Class to provide pan and zoom functionality.
| + | |
- | | + | |
- | ==[[Examples of Text Class Implementations and Practices]]==
| + | |
| | | |
| [http://moa.umdl.umich.edu Making of America], [http://qoud.lib.umich.edu/a/ampo20/ Twentieth Century American Poetry]. | | [http://moa.umdl.umich.edu Making of America], [http://qoud.lib.umich.edu/a/ampo20/ Twentieth Century American Poetry]. |
Line 92: |
Line 17: |
| ==[[Working with Text Class Markup]]== | | ==[[Working with Text Class Markup]]== |
| | | |
- | | + | This section describes the the Text Class DTD and mechanisms and programs used by DLXS for accessing and displaying images that are defined by the FIGURE element in TextClass document markup. |
- | ==[[FIGURE Resolution]]==
| + | |
- | | + | |
- | This document describes the mechanisms and programs used by DLXS for accessing and displaying images that are defined by the FIGURE element in TextClass document markup. | + | |
- | | + | |
- | * [#figureelement The FIGURE Element]
| + | |
- | * [#generalresolution FIGURE Resolution in General]
| + | |
- | ** [#idresolvermech The IdResolver Resolution Mechanism]
| + | |
- | ** [#filesystemmech The Filesystem Resolution Mechanism]
| + | |
- | * [#defaultbehavior Default Behavior and Custom Configuration]
| + | |
- | | + | |
- | = The FIGURE Element =
| + | |
- | | + | |
- | The FIGURE element is used in TextClass markup to encode the occurrence of a figure in a text document. The FIGURE element's ENTITY attribute carries the ID that the middleware resolves to an image file on disk or to an image under management by ImageClass.
| + | |
- | | + | |
- | If the document is a TEI Level 1 (page image document viewable in Pageviewer) the FIGURE tag supplies additional data about the bitonal page image beyond that supplied by the PB tag. The additional data might be a second continuous tone scan of the entire page or a continuous tone scan of a plate within the page. The middleware makes the contone images viewable under ImageClass via links built in the full OCR text of the document (in cases where that OCR is displayable) or via [pageviewer.html#pageviewFig additional links in Pageviewer]. For higher TEI levels, the FIGURE tag typically calls out the occurrence of an inline figure, the image for which is usually on disk but may also be retrievable from ImageClass through the IdResolver mechanism described in the [pageviewer.html#DLXS_IDResolver IdResolver section] of the Pageviewer documentation.
| + | |
- | | + | |
- | The balance of this document describes how the FIGURE element is resolved into an image via two mechanisms: ''Filesystem resolution'' and ''IdResolver resolution''. Also described is how the resolution can be affected by writing subclasses of TextClass.
| + | |
- | | + | |
- | = FIGURE Resolution in General =
| + | |
- | | + | |
- | The resolution mechanism is summarized in the following diagram
| + | |
- | | + | |
- | <FIGURE ENTITY="abc"> ? transform "abc" to key ? lookup key ? URL or file system path
| + | |
- | | + | |
- | The document markup is parsed and the ENTITY attribute value of a given FIGURE tag is extracted. The attribute value is transformed into a key/path suitable for lookup via the DLXS IdResolver or by looking on the disk. If IdResolver is used, the corresponding ImageClass URL is returned. If the filesystem is used, the path to the file in the web space is determined. This URL/path becomes the value of a new attribute which is added to the FIGURE tag and passed along for eventual processing by the XSL stylesheet (principally <tt>text.components.xsl</tt>). The XSL stylesheet typically transforms the FIGURE tag together with its attributes into an HTML '''img''' tag, possibly wrapped with an HTML anchor ('''a''') tag, if a popup window or link into ImageClass is required to view the image of the figure.
| + | |
- | | + | |
- | Actually, depending on the situation, the middleware adds two or three new attributes to the FIGURE tag for XSL processing. They are '''FIGTYPE''', '''HREF_1''' and '''HREF_2'''. Collection-specific configuration referred to below is discussed in the [#defaultbehavior configuration section] later in this document.
| + | |
- | | + | |
- | # '''FIGTYPE''' carries one of two values:
| + | |
- | #* '''INLINE''' when the configuration is for figures that appear inline with the text. In this case, only of the '''HREF_1''' attribute will appear.
| + | |
- | #* '''THUMB''' when the configuration is for figures that have thumbnail images that appear inline with the text. In this case, both of the '''HREF_*''' attributes will appear.
| + | |
- | # '''HREF_1''' is the URL or web space path to the inline image when '''FIGTYPE=INLINE''' or to a thumbnail image of the full sized image when '''FIGTYPE=THUMB'''.
| + | |
- | # '''HREF_2''' is the URL or web space path to the full sized version of the figure image when '''FIGTYPE=THUMB'''.
| + | |
- | | + | |
- | '''Note: '''The <tt>debug=resolver</tt> URL parameter can be added to the end of the URL to see the action of the resolver as it operates on the ENTITY attribute of the FIGURE tag.
| + | |
- | | + | |
- | === The IdResolver Resolution Mechanism ===
| + | |
- | | + | |
- | If the configuration indicates figure images are "external", i.e. that the figure images are managed in ImageClass or by a 3rd-party host, the IdResolver is used to resolve the ID to an ImageClass or 3rd-party URL. The Idresolver mechanism is described in the [pageviewer.html#DLXS_IDResolver IdResolver section] of the Pageviewer documentation.
| + | |
- | | + | |
- | === The Filesystem Resolution Mechanism ===
| + | |
- | | + | |
- | If the configuration indicates figure images are not "external", the figure images should be found as files in the web space. The middleware constructs a default path to the image in the web speace as <tt>/webdir/images/''ENTITY''.extension</tt> where '''webdir''' is the collmgr value for the collection and <tt>extension</tt> comes from a list (.gif, .jpg, etc.). The middleware tests for file existence in the web space for each extension until a hit occurs. This allows files of several different formats to coexist in the web space.
| + | |
- | | + | |
- | The DLXS directory convention is to store these image files in <tt>DLXSROOT/img/c/coll</tt> and make a symbolic link to that directoy in <tt>DLXSROOT/web/c/coll</tt> called <tt>images</tt>.
| + | |
- | | + | |
- | = Default Behavior and Custom Configuration =
| + | |
- | | + | |
- | The default for the figure resolution mechanism assumes all figures are inline, on disk, without corresponding thumbnail images. Within this constraint it is possible to change the way the path to the disk file is generated to derive a number of naming conventions based on the bare entity attribute value. This is described below.
| + | |
- | | + | |
- | Modifying the described configuration of the figure resolution mechanism is accomplished by writing a subclass of the TextClass package for each affected collection. The size of the methods that need to be written is small, typically just a line or two of code.
| + | |
- | | + | |
- | Following is a synopsis of the methods provided for subclassing. Please consult the code in <tt>DLXSROOT/cgi/t/text/TextClass.pm</tt> and its subclasses for more detail.
| + | |
- | | + | |
- | * '''ExternalFigFullImage''' by default returns 0 indicating that figure images are on disk. Override this and return 1 to indicate images should be resolved into ImageClass or 3rd-party URLs by IdResolver.
| + | |
- | * '''ExternalFigThumb''' by default returns 0 indicating that thumbnail images of figures are on disk. Override this and return 1 to indicate thumbnail images should be resolved into ImageClass or 3rd-party URLs by IdResolver.
| + | |
- | * '''CollUsesThumbs''' by default returns 0. Override this and return 1 to indicate that figures should be rendered as inline thumbnails linked to a full sized version of the image. Note that thumbnails and their full sized counterparts can reside under ImageClass or be accessed directly from the disk.
| + | |
- | * '''GenerateExternalId''' returns the default format for the IdResolver ID (key) as <tt>ic-''collid-idno''-''ENTITY''-[th|rec]</tt> where '''th''' is for the thumb and '''rec''' is for the full sized image. '''''ENTITY''''' is the value of the '''ENTITY''' attribute in the <FIGURE> tag in the XML. By overriding this method, different ID schemes can be implemented.
| + | |
- | * '''GenerateFilesysPathElement''' by default simply returns the element for concatenation as described above. In the case where '''CollUsesThumbs''' is true and thumbnails and full sized images are on disk, this method can be overridden to supply different names for the thumbnails and full sized image files.
| + | |
- | * '''FigureUriHook''' by default returns the URL passed in. For collections where the URL of the image needs to be wrapped in a javascript call this method can be subclassed.
| + | |
- | * '''SyntheticAttrsHook''' can be subclassed to add additional attributes to the FIGURE tag for specialized process by the XSL stylesheets for collection-specific customization.
| + | |
- | * '''FilesysIdResolver''' This method can be subclassed to supply a custom list of file extensions to check if the base class default list is not inclusive enough. In the subclass implementation, the list is passed as the optional last parameter to an invocation of the base class method, i.e. as <tt>SUPER::FilesysIdResolver</tt>.
| + | |
- | | + | |
- | ===[[Working with the Text Class DTD]]===
| + | |
- | = Introduction =
| + | |
- | | + | |
- | This document lays out the elements and attributes required in all Text Class texts. It is assumed that any texts to be converted to Text Class already validate against another DTD for encoding monographic materials, such as TEI Lite, that represent corresponding structures (chapters, pages, etc.). Because of the extremely lax content modelling (almost every element is defined to permit ANY of the named elements), the Text Class DTD is useful only to ensure that the nomenclatures have been changed appropriately.
| + | |
- | | + | |
- | = Basic Requirements =
| + | |
- | | + | |
- | The following elements and attributes are required:
| + | |
- | | + | |
- | * DLPSTEXTCLASS
| + | |
- | * HEADER
| + | |
- | * EDITORIALDECL with N attribute according to [http://www.indiana.edu/~letrs/tei/ TEI in Libraries recommendations] for encoding levels.
| + | |
- | * IDNO -- must be unique for each text in a collection, and should not contain spaces
| + | |
- | * TEXT
| + | |
- | * BODY
| + | |
- | * Nested numbered DIVs, with at least one DIV1
| + | |
- | ** Each DIV must have TYPE and NODE attributes
| + | |
- | ** All content, such as pagebreaks, should be in DIVs, not between, if at all possible
| + | |
- | * Nested numbered elements that contain themselves
| + | |
- | ** For example, NOTE, Q, HI become NOTE1, Q1, H1 containing NOTE2, etc.
| + | |
- | * NOTE1 and PTR must have N and ID attributes
| + | |
- | ** default value for N is *; IDs will be automatically assigned for elements without them
| + | |
- | | + | |
- | See the following example document:
| + | |
- | | + | |
- | <DLPSTEXTCLASS><br /> <HEADER><br /> <FILEDESC><br /> <TITLESTMT><br /> <TITLE TYPE="245">Some Book</TITLE><br /> <AUTHOR>Some Author</AUTHOR><br /> </TITLESTMT><br /> <EXTENT>1 600dpi TIFF G4 page image</EXTENT><br /> <PUBLICATIONSTMT><br /> <PUBLISHER>Humanities Text Initiative</PUBLISHER><br /> <PUBPLACE>University of Michigan</PUBPLACE><br /> <IDNO TYPE="dlps">acw8433.1-01.001</IDNO><br /> <AVAILABILITY><br /> <P>These pages may be freely searched and displayed. Permission must be received for subsequent distribution in print or electronically. Please contact umdl-info@umich.edu for more information.</P><br /> </AVAILABILITY><br /> </PUBLICATIONSTMT><br /> <SOURCEDESC><br /> <BIBL><br /> <TITLE TYPE="main">Some Book</TITLE><br /> <AUTHOR>Some Author</AUTHOR><br /> <PUBLISHER>D. Appleton and Company.</PUBLISHER><br /> <PUBPLACE>New York,</PUBPLACE><br /> <DATE>Apr 3, 1869</DATE><br /> </BIBL><br /> </SOURCEDESC><br /> </FILEDESC><br /> <ENCODINGDESC><br /> <PROJECTDESC><br /> <P>Revised with script moajrnltotextclass.pl on 2001-11-21.</P><br /> </PROJECTDESC><br /> <EDITORIALDECL N="1"><br /> <P>This electronic text file was created by Optical Character Recognition (OCR). No corrections have been made to the OCR-ed text and no editing has been done to the content of the original document. Encoding has been done through an automated process using the recommendations for Level 1 of the TEI in Libraries Guidelines. Digital page images are linked to the text file.</P><br /> </EDITORIALDECL><br /> </ENCODINGDESC><br /> <PROFILEDESC><br /> <TEXTCLASS><br /> <KEYWORDS><br /> <TERM>Equations, Abelian.</TERM><br /> </KEYWORDS><br /> </TEXTCLASS><br /> </PROFILEDESC><br /> </HEADER><br /> <TEXT><br /> <BODY><br /> <DIV1 NODE="acw8433.1-01.001:1"><br /> <P><br /> <PB REF="00000001.tif" SEQ="00000001" RES="600dpi" FMT="TIFF6.0" FTR="UNSPEC" CNF="874" N=""/>asdfqwerwerwer <NOTE1 N="*" ID="note0001">see asdfasdft</NOTE1></P><br /> </DIV1><br /> </BODY><br /> </TEXT><br /> </DLPSTEXTCLASS>
| + | |
- | | + | |
- | ==Build the Collection Specific Text Class DTD==
| + | |
- | Before indexing your collection, you will need to create a collection-specific Text Class DTD. Because the class DTD supports any element having any of the declared attributes (for convenience of DTD creation), indexing "as-is" with XPATu will cause your index to be unnecessarily large. This may also cause problems for XML validation tools. You can create your own collection specific version of the Text Class DTD by running the following command:
| + | |
- | | + | |
- | egrep -i "<\!ELEMENT" $DLXSROOT/misc/sgml/textclass.xml.dtd > textclass.stripped.xml.dtd
| + | |
- | | + | |
- | We'll use the "make dtd" command from the Makefile to determine which attributes are used in your collection and build a custom DTD by concatenating it with $DLXSROOT/misc/xml/textclass.stripped.xml.dtd. Using the "make validate" command will then validate your collection against the new DTD. If the individual texts validated before, they should validate as a concatenated collection now.
| + | |
- | | + | |
- | cd $DLXSROOT/bin/w/workshoptc
| + | |
- | | + | |
- | make dtd
| + | |
- | | + | |
- | make validate
| + | |
| | | |
| ==[[Working with Text Class Data and Directories]]== | | ==[[Working with Text Class Data and Directories]]== |
- | = Setting up directories =
| |
| | | |
- | You will need to identify directories where you plan to store your source files, your converted and concatenated Text Class XML file, your index file (approximately 75% of the size of your SGML source), your "region" files and other information such as data dictionaries, and files you use to prepare your data. We recommend you use the following structure:
| + | ==[[Data Preparation and Conversion: Unicode, XML, and Normalization]]== |
| | | |
- | * Store specialized scripts for your collection and its Makefile in <tt>$DLXSROOT/bin/c/collid/</tt>where <tt>$DLXSROOT</tt> is the "tree" where you install all DLXS components, <tt>c</tt> is the first letter of the name of the collection you are indexing, and <tt>collid</tt> is the collection ID of the collection you are indexing. For example, if your collection ID is "moa" and your DLXSROOT is "/l1", you will place the Makefile in <tt>/l1/bin/m/moa/</tt>, e.g., <tt>/l1/bin/m/moa/Makefile</tt>. See [[directory conventions]] for more information.
| + | ==[[Working with Fabricated Regions]]== |
- | * Store your source texts and any DTDs, doctype, and files for preparing your data in <tt>$DLXSROOT/prep/c/collid/</tt>. Unlike the contents of other directories, everything in prep should be ultimately expendable in the production environment.
| + | |
- | * Store the finalized, concatenated Text Class XML file for your text collection in <tt>$DLXSROOT/obj/c/collid/</tt> , e.g., <tt>/l1/obj/m/moa/moa.xml</tt>.
| + | |
- | * Store index, region, data dictionary, and init files in <tt>$DLXSROOT/idx/c/collid/</tt>, e.g., <tt>/l1/idx/m/moa/moa.idx</tt>. See the [[XPAT documentation]] for more on these types of files.
| + | |
| | | |
- | The files that are located in <tt>$DLXSROOT/bin/s/sampletc_utf8</tt> and<tt>$DLXSROOT/prep/s/sampletc_utf8</tt> should be copied into your collection directories and used to index your collection. The following files may need to be editted so that the #! points to your location of perl:
| + | ==[[Building the Text Class Index]]== |
| | | |
- | * $DLXSROOT/bin/t/text/isolat128bit.pl
| + | ==[[Mounting the Text Class Collection Online]]== |
- | * $DLXSROOT/bin/t/text/output.dd.frag.pl
| + | |
- | * $DLXSROOT/bin/t/text/inc.extra.dd.pl
| + | |
| | | |
- | The following files will need to be edited to reflect your collection names and paths:
| + | ==[[Adding Text Class Search Functionality]]== |
- | | + | |
- | * $DLXSROOT/bin/s/sampletc_utf8/Makefile
| + | |
- | * $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.blank.dd
| + | |
- | * $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.extra.srch
| + | |
- | * $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.inp
| + | |
- | | + | |
- | | + | |
- | ==[[Data Preparation and Conversion]]== | + | |
- | == Data Conversion: Unicode, XML, and Normalization ==
| + | |
- | | + | |
- | To make the most of Text Class and Finding Aids Class in DLXS Release 13, you will want to convert or otherwise handle the character entities, numeric entities, or Latin 1 8-bit characters that have been the staples of SGML (and XML, despite the default encoding of UTF-8) for so long. This is separate from the conversion of Text Class materials to the Text Class DTD; even with finding aids that are already in XML, you will probably need to do some testing of character encodings, conversion of these encodings to UTF-8, normalization, and conversion of SGML to XML (strange but true).
| + | |
- | | + | |
- | ----
| + | |
- | | + | |
- | === '''Determining the Character Encodings Present in Your Data''' ===
| + | |
- | | + | |
- | There are a number of possibilities you may encounter:
| + | |
- | | + | |
- | # Plain ASCII (aka the Basic Latin block)
| + | |
- | # Character entity references (ISO and otherwise)
| + | |
- | # Numeric character references (decimal and/or hexadecimal)
| + | |
- | # Latin 1 characters
| + | |
- | # UTF-8 characters
| + | |
- | | + | |
- | You may very well find a mixture of 1, 2, 3, and 4 or even 2, 3, and 5 in the wild, simply because many encoders are not clear on what they should be doing with special characters. One hopes you will not encounter a document with a mixture of Latin 1 and UTF-8 characters, although it's possible that misidentified files could end up concatenated together and create such a mess.
| + | |
- | | + | |
- | There are a number of tools you can use to identify what you have before you.
| + | |
- | | + | |
- | ; findentities.pl
| + | |
- | : A perl script written by Phil that is part of the DLXS package, it prints the names and frequencies of the entities (CERs and NCRs) it encounters. Fairly quick, regardless of the size of the file. Can be run on more than one file at once, which is handy if you have a batch of texts.
| + | |
- | ; xpatutf8check
| + | |
- | : Another perl script written by Phil, it exists to answer the question, "Will xpatu index this?" It will report the line number of the first non-UTF character it encounters when it has failed and it runs very quickly, so it's great as a first step in checking your material, but it is not authoritative enough to identify all of the problems you may have.
| + | |
- | ; jhove
| + | |
- | : The JSTOR/Harvard Object Validation Environment has a UTF-8 module that reports whether your document is or is not valid UTF-8, and which Unicode blocks are contained in the document. Can be slow checking large documents, but very informative. Available at http://hul.harvard.edu/jhove/ and invoked with
| + | |
- | jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul file.xml
| + | |
- | ; utf8chars
| + | |
- | : Yet another perl script written by Phil, it identifies the characters used in a document and the Unicode blocks to which they belong. It assumes your document is UTF-8 and will report each instance (by line number) where a non-UTF character is encountered. Because it is identifying and counting each character in a document, it is rather slow, but very very useful. Runs on one file at a time and prints to standard out, but can be invoked through a <tt>foreach</tt> to check many files in one command.
| + | |
- | | + | |
- | ----
| + | |
- | | + | |
- | === '''Converting Those Character Encodings to UTF-8''' ===
| + | |
- | | + | |
- | If you have a mixed bag of encodings and entities in your documents, there's a definite order in which you want to approach the conversion task, to avoid having a mixture of Latin 1 and UTF-8 in one document at any point in the transformation.
| + | |
- | | + | |
- | # First, if you have Latin 1 characters like ârun iconv, part of the Gnu C library, to convert files from one encoding to another.
| + | |
- | iconv -f iso88591 -t utf8 oldfile > newfile
| + | |
- | # Next, convert character entity references like &acirc; using isocer2utf8, a perl script written by Phil to convert character entity references to UTF-8 characters. Although it references ISO in the name, it's been expanded to handle all the CERs we've encountered, including TEI Greek and the Chadwyck-Healey custom entities.
| + | |
- | /l1/bin/t/text/isocer2utf8 oldfile > newfile
| + | |
- | # Finally, if you have numeric character references like &#226; or &#xE2;, run ncr2utf8, also written by Phil, to convert decimal and hexadecimal entities to UTF-8 characters.
| + | |
- | /l1/bin/t/text/ncr2utf8 oldfile > newfile
| + | |
- | | + | |
- | This would be a good point to run findentities.pl again to see what (if anything) you have left, and to re-validate using jhove or utf8chars to ensure that you have done no harm.
| + | |
- | | + | |
- | ----
| + | |
- | | + | |
- | === '''Test Driving the Tools''' ===
| + | |
- | | + | |
- | In the directory /l1/workshop-samples/sooty, you will find four sample files that we'll examine for character encoding and then convert to UTF-8: findaid1.xml, findaid2.xml, text1.xml, and text2.sgm. Copy these to your own directory -- they are completely expendable and won't serve a purpose in tomorrow's Text Class implementation. They are merely illustrative of all the possibilities you might encounter and how you may want to handle them.
| + | |
- | | + | |
- | First, we'll look at which character or numeric entities, if any, are used in these documents.
| + | |
- | | + | |
- | <blockquote>
| + | |
- | | + | |
- | foreach file (findaid*)<br />echo $file <br />$DLXSROOT/bin/t/text/findEntities.pl $file <br />end
| + | |
- | | + | |
- | foreach file (text*)<br />echo $file <br />$DLXSROOT/bin/t/text/findEntities.pl $file <br />end
| + | |
- | | + | |
- | </blockquote>
| + | |
- | | + | |
- | We have some CERs and NCRs to deal with, aside from the five XML-approved entities (&amp;, &gt;, &lt;, &apos;, and &quot;). So, we know we'll be needing both <tt>isocer2utf</tt> and <tt>ncr2utf</tt>. Next, we'll see what characters we have (Latin 1? UTF-8? something else?). We'll run through all three tools, just for the sake of completeness, in the order of speediness and terseness.
| + | |
- | | + | |
- | <blockquote>
| + | |
- | | + | |
- | foreach file (findaid*)<br />echo $file <br />xpatutf8check $file <br />end
| + | |
- | | + | |
- | foreach file (text*)<br />echo $file <br />xpatutf8check $file <br />end
| + | |
- | | + | |
- | </blockquote>
| + | |
- | | + | |
- | We now know that both the text files are either UTF-8 or plain ASCII (because of the output of these two tests), but there's a problem with one of the finding aids. jhove will tell us a bit more about our materials. You'll note we don't need to echo the filename as that's part of the jhove report. You'll also notice jhove is not so fast.
| + | |
- | | + | |
- | <blockquote>
| + | |
- | | + | |
- | foreach file (findaid*)<br />jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul $file <br />end
| + | |
- | | + | |
- | foreach file (text*)<br />jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul $file <br />end
| + | |
- | | + | |
- | </blockquote>
| + | |
- | | + | |
- | So, the second file in each set is plain ASCII (the Basic Latin block) with entities, the first finding aid is not UTF-8, and the first text file is. Let's look a bit more at the two non-ASCII files with the slowest and most verbose tool of them all. We're not doing a foreach this time, but we wouldn't need to echo the filename either, as it is again part of what the tool reports.
| + | |
- | | + | |
- | <blockquote>
| + | |
- | | + | |
- | utf8chars findaid1.xml
| + | |
- | | + | |
- | utf8chars text1.xml
| + | |
- | | + | |
- | </blockquote>
| + | |
- | | + | |
- | We can see the exact problem with findaid1.xml -- there's an 8-bit Latin 1 e acute before Boekeloo on line 37. We also can see all the UTF-8 characters in text1.xml -- this is the sort of information that is useful when time comes to map characters and encodings in the xpatu data dictionary.
| + | |
- | | + | |
- | Now that we know which items need what character treatments, we'll convert them. text1.xml is completely fine, so we'll leave it as is. findaid1.xml has the one Latin 1 character, so we'll use iconv to convert it to UTF-8. It had no entities of any kind, so we'll be done with it after this step.
| + | |
- | | + | |
- | <blockquote>
| + | |
- | | + | |
- | iconv -f iso88591 -t utf8 findaid1.xml > findaid1.xml.utf
| + | |
- | | + | |
- | </blockquote>
| + | |
- | | + | |
- | Next, findaid2.xml had numeric character references. It is fine and can be indexed as-is, but users would need to search for the hexadecimal string in the midst of words ( &#xE9; for é, for example). So, we'll use ncr2utf to convert the entities into the characters. WARNING! &#x26; is the ampersand (as is &#38;) -- if you convert these to the character, you will run into validation problems down the road, as bare ampersands are not permitted in XML. Don't get carried away!
| + | |
- | | + | |
- | <blockquote>
| + | |
- | | + | |
- | ncr2utf8 findaid2.xml > findaid2.xml.utf
| + | |
- | | + | |
- | </blockquote>
| + | |
- | | + | |
- | Finally, text2.sgm has ISO character entity references (from Latin 1, Greek, and Publishing) that need to be converted to UTF-8 with isocer2utf.
| + | |
- | | + | |
- | <blockquote>
| + | |
- | | + | |
- | isocer2utf8 text2.sgm > text2.sgm.utf
| + | |
- | | + | |
- | </blockquote>
| + | |
- | | + | |
- | Note that the ampersand CER was not processed. This is perfectly correct.
| + | |
- | | + | |
- | === ''More Documentation'' ===
| + | |
- | | + | |
- | * [http://www.dlxs.org/docs/13/class/unicode.html DLXS Unicode Data Preparation and Online Presentation Issues]
| + | |
- | * [http://www.zvon.org/other/charSearch/PHP/search.php Zvon Character Search]
| + | |
- | * [http://www.alanwood.net/unicode/ Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications]
| + | |
- | | + | |
- | ----
| + | |
- | | + | |
- | === '''Normalization and Converting SGML to XML''' === | + | |
- | | + | |
- | Many of you may be in a position where you'll want to be converting your SGML files to XML. Many of you will be fortunate enough to have files already in XML -- say, finding aids in EAD 2002. However, these will have to be normalized, too, to avoid problems with xpatu and xmlrgn down the road by ensuring that all the attributes are in the same order as specified in the DTD. Because of known but uncorrected problems in the normalization tools, you will end up with SGML and will need to convert that to XML.
| + | |
- | | + | |
- | Because the file we want to work with is now UTF-8, we need to set some environment variables for the tools from the sp package to let them know this is UTF-8. It doesn't matter that you've set your puTTy window to UTF-8, if you are using osx, osgmlnorm, or onsgmls, you must set your environment properly.
| + | |
- | | + | |
- | <blockquote>
| + | |
- | | + | |
- | setenv SP_CHARSET_FIXED YES
| + | |
- | | + | |
- | setenv SP_ENCODING utf-8
| + | |
- | | + | |
- | </blockquote>
| + | |
- | | + | |
- | First we normalize, invoking a declaration to handle the non-SGML UTF-8 characters without claiming that the material itself is XML.
| + | |
- | | + | |
- | <blockquote>
| + | |
- | | + | |
- | osgmlnorm $DLXSROOT/misc/sgml/xmlentities.dcl sample.inp text2.sgm.utf > text2.sgm.norm
| + | |
- | | + | |
- | </blockquote>
| + | |
- | | + | |
- | Now I'll test the output with one of the UTF-8 tools to make sure that it's come through unscathed, and with findEntities.pl to see what has happened with the remaining XML-friendly entities, and it's fine. Now to convert our SGML to XML using osx.
| + | |
- | | + | |
- | <blockquote>
| + | |
- | | + | |
- | osx -x no-nl-in-tag -x empty -E 500 -f errors $DLXSROOT/misc/sgml/xmlentities.dcl sample.inp text2.sgm.norm > text2.xml
| + | |
- | | + | |
- | </blockquote>
| + | |
- | | + | |
- | Again I'll test the output with one of the UTF-8 tools to make sure that it's come through unscathed, and with findEntities.pl to see what has happened with the remaining XML-friendly entities, and again it's fine.
| + | |
- | | + | |
- | Just for fun, we'll normalize the files already in XML, just to show that things get changes from XML to SGML against their will.
| + | |
- | | + | |
- | <blockquote>
| + | |
- | | + | |
- | osgmlnorm $DLXSROOT/misc/sgml/xml.dcl $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.text.inp text1.xml > text1.xml.norm
| + | |
- | | + | |
- | </blockquote><blockquote>
| + | |
- | | + | |
- | osx -x no-nl-in-tag -x empty -E 5000 -f error $DLXSROOT/misc/sgml/xml.dcl $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.text.inp text1.xml.norm > text1.xml.norm.xml
| + | |
- | | + | |
- | </blockquote>
| + | |
- | | + | |
- | === ''More Documentation'' ===
| + | |
- | | + | |
- | * [http://www.dlxs.org/docs/13/class/unicode.html DLXS Unicode Data Preparation and Online Presentation Issues]
| + | |
- | * [http://linuxcommand.org/man_pages/osgmlnorm1.html osgmlnorm Documentation]
| + | |
- | * [http://linuxcommand.org/man_pages/osx1.html osx Documentation]
| + | |
- | * [http://www.jclark.com/sp/ James Clark's Original sp Documentation]
| + | |
- | | + | |
- | ----
| + | |
- | | + | |
- | === '''Using Unconverted Collections with Current Middleware''' ===
| + | |
- | | + | |
- | Recognizing that there may be situations when you do not wish to migrate a collection to XML and UTF-8 immediately, there are mechanisms built into Release 13 to allow the middleware to handle SGML-style empty elements (aka singletons), Latin 1 characters, and character entity references. How do you make this happen? In DlpsUtils.pm, there is a subroutine called Sgml2XmlFilter that has a hard-coded list of empty elements (<PB>, <LB>, <CAESURA>, etc.) that are converted upon discovery to XML-style (<PB/>, <LB/>, <CAESURA/>, etc.), and a feature that converts Latin 1 (ISO-8859-1) characters to UTF-8. This subroutine comes into play if the locale field in collmgr is not set to en_US.UTF-8 (locale used to be optional but is now required if you are using UTF-8 and xpatu). In order to declare your entities, you need to put a file called <tt>entitiesdoctype.chnk</tt> in the web directory for your collection, declaring the entities like so:
| + | |
- | | + | |
- | <blockquote>
| + | |
- | | + | |
- | <!DOCTYPE Top [
| + | |
- | <!ENTITY Dstrok "&#x0110;">
| + | |
- | <!ENTITY Sacute "&#x015A;">
| + | |
- | <!ENTITY Scaron "&#352;">
| + | |
- | <!ENTITY Ubreve "&#x016C;">
| + | |
- | <!ENTITY Zdot "&#x017B;">
| + | |
- | ]>
| + | |
- | | + | |
- | </blockquote>
| + | |
- | | + | |
- | That being the case, why would anyone ever bother to go through the trouble of converting their material? First off, the value of having UTF-8 is apparent if you have material that used more than one entity set (and even the lowliest collections have both an e acute and an em-dash in them somewhere). Now that &mdash; is one character that can be mapped to a space in the data dictionary like other punctuation, phrases that were obscured in searches now turn up, and characters that we used to flatten (transforming ? to a, for example) can be displayed. Second, this facility comes at a cost. All of the material returned needs to be run through this filter, which will take some time. In a results list, the lag is negligable, but in larger sections of text, it could be noticeable. Finally, some confusion might arise when a user cuts and pastes material he received as a result and cannot retrieve it again, because the results and input are UTF-8 (which is the encoding of the search form) but the material being searched is not.
| + | |
- | | + | |
- | ==[[Working with Fabricated Regions]]==
| + | |
- | When you use XPAT in combination with xmlrgn or sgmlrgn and a DTD, or multirgn and a tags file, you are identifying the elements and attributes in the DTD or tags file as "regions," containers of content rather like fields in a database. These separate regions are built into the regions file (collid.rgn) and are identified in the data dictionary (collid.dd). This is what is happening when you are running sgmlrgn and/or xmlrgn.
| + | |
- | | + | |
- | However, sometimes the things you want to identify collectively aren't so handily identified as elements in the DTD. For example, suppose you want to search within specific features of a book, such as a chapter, that can occur at different heirarchical levels in different volumes. Also, the element isn't even called CHAPTER; it's a numbered division with a type attribute telling you that it's a chapter.
| + | |
- | | + | |
- | In order to fabricate a region containing all the divisions in books that are chapters, for example you can first find all the regions with a query:
| + | |
- | | + | |
- | (region DIV1 incl (region "DIV1-T" incl "type=chapter"))+ (region DIV2 incl (region "DIV2-T" incl "type=chapter"))
| + | |
- | | + | |
- | You could do a more complex search based on treating attributes as regions instead of text strings, which is functionally the same:
| + | |
- | | + | |
- | (region DIV1 incl (region "DIV1-T" incl (region "A-TYPE" incl chapter)))+ (region DIV2 incl (region "DIV2-T" incl (region "A-TYPE" incl chapter)))
| + | |
- | | + | |
- | Finally, once you have a query that produces the results you want, create a file for export, export it, and sync it:
| + | |
- | | + | |
- | {exportfile "$DLXSROOT/idx/c/collid/chapter.rgn"}
| + | |
- | export
| + | |
- | ~sync "chapter"
| + | |
- | | + | |
- | Why fabricate regions? Why not just put these queries in the map file and call them chapters? While you could, it's probably worth your time to build these succinctly-named and precompiled regions; query errors are more easily identified in the index building than in the CGI, and XPAT searches can be simpler and quicker for terms within the prebuilt regions.
| + | |
- | | + | |
- | Fabricated regions within Text Class can be found in the extra.srch file for the sample collection at <tt>$DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.extra.srch</tt> . As with any other elements used in the interface for a given collection, fabricated regions used must also be represented in the collmgr entry and the map file for that collection.
| + | |
- | | + | |
- | = Fabricated regions required in Text Class =
| + | |
- | | + | |
- | * main
| + | |
- | * mainheader
| + | |
- | * maintitle
| + | |
- | * div1head
| + | |
- | | + | |
- | = Fabricated regions commonly found in Text Class =
| + | |
- | | + | |
- | * mainauthor
| + | |
- | * maindate page (for collections with page images)
| + | |
- | * id (for collections with a number of different IDNO elements)
| + | |
- | * divxhead (for collections nested below DIV1)
| + | |
- | | + | |
- | ==[[Customizing Text Class]]==
| + | |
- | This could provide other topics, but I think it would be a good place for a discussion of the pieces of text class that are customizable ranked by level of difficulty. Then you could link to the places in the doc that discuss specific methods for customizing different bits.
| + | |
- | | + | |
- | ==[[Building the Index]]==
| + | |
- | After you have followed all the steps to set up your directories and prepare your files, as found in the [[Text Class preparation documentation]], indexing the collection is fairly straightforward. To create an index for use with the Text Class interface, you will need to index the words in the collection, then index the XML (the structural metadata, if you will), and then finally "fabricate" structures based on a combination of elements (for example, defining what the "main entry" is, without adding a <MAINENTRY> tag around the appropriate <AUTHOR> or <TITLE> element). The following commands can be used to make the index, alone or in combination.
| + | |
- | | + | |
- | # Ensure that your collection XML is valid by using the <tt>make validate</tt> command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile
| + | |
- | # '''make singledd''' indexes words for texts that have been concatenated into one large file for a collection. This is the recommended process, as a data dictionary built from a single concatenated file is faster for searching and more reliable than one built using multi-file system indexing. Use the <tt>make singledd</tt> command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile .
| + | |
- | # '''make xml''' indexes the XML structure by reading the DTD, and validates as it indexes. It is slower than multiregion indexing (see [[XPAT documentation]] for more information) for this reason. However, it is necessary for collections that have nested elements of the same name (even when separated by an intervening element, such as a <P> within <NOTE1> that is itself within a <P>). Use the <tt>make xml</tt>command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile
| + | |
- | # '''make post''' builds and indexes fabricated regions based on the XPAT queries stored in the $DLXSROOT/prep/c/collid/collid.extra.srch file. Because every collection is different, this file will need to be adapted after you have determined what you want to use as a "poem" for text (e.g., perhaps every DIV1 TYPE="sonnet" and DIV2 TYPE="poem" in the collection) and how many levels of division heads you have in your collection (e.g., at least one text is nested to DIV4, so you'll need to fabricate up to div4head). If the extra.srch file references elements not used in your text collection, you will see errors like <tt>Error found: <Error>syntax error before: ")</Error> </tt> when you use the <tt>make post</tt> command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile. Remove unnecessary lines.
| + | |
- | | + | |
- | You have now built indexes and region files for your collection. You can test that things are properly indexed by issuing the command<tt>xpatu $DLXSROOT/idx/c/collid/collid.dd</tt> and doing searches, such as for a common word like <tt>the</tt> or an element that should appear such as <tt>region "main"</tt> or <tt>region "HEADER"</tt>. Strategically, it is good to test this from a directory other than the one you indexed in, to ensure that relative or absolute paths are resolving appropriately.
| + | |
- | | + | |
- | Everything is now set up to build the XPATu index. The Makefile in the bin directory contains the commands necessary to build the index, and can be executed easily.
| + | |
- | | + | |
- | <blockquote>
| + | |
- | | + | |
- | make dd
| + | |
- | | + | |
- | </blockquote><blockquote>
| + | |
- | | + | |
- | cp /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.blank.dd /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd
| + | |
- | /l/local/bin/xpatbldu -m 256m -D /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd
| + | |
- | cp /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.presgml.dd
| + | |
- | | + | |
- | </blockquote><blockquote>
| + | |
- | | + | |
- | make xml
| + | |
- | | + | |
- | </blockquote><blockquote>
| + | |
- | | + | |
- | cp /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.presgml.dd /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd
| + | |
- | /l/local/bin/xmlrgn -D /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd /l1/workshop/sooty/dlxs/misc/sgml/xml.dcl /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.inp /l1/workshop/sooty/dlxs/obj/w/workshoptc/workshoptc.xml
| + | |
- | /l/local/bin/xmlrgn:/l1/workshop/sooty/dlxs/misc/sgml/xml.dcl:1:W: SGML declaration was not implied
| + | |
- | cp /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.prepost.dd
| + | |
- | | + | |
- | </blockquote><blockquote>
| + | |
- | | + | |
- | make post
| + | |
- | | + | |
- | </blockquote><blockquote>
| + | |
- | | + | |
- | cp /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.prepost.dd /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd
| + | |
- | touch /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.init
| + | |
- | /l/local/bin/xpatu -q /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd < /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.extra.srch | /l1/workshop/sooty/dlxs/bin/t/text/output.dd.frag.pl /l1/workshop/sooty/dlxs/idx/w/workshoptc/ > /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.extra.dd
| + | |
- | /l1/workshop/sooty/dlxs/bin/t/text/inc.extra.dd.pl /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.extra.dd /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd
| + | |
- | | + | |
- | </blockquote>
| + | |
- | | + | |
- | <span><font color="#FF0000">Sorting and browse building require that you have only '''one''' maintitle, mainauthor and maindate per text, so that you have one value on which to sort. Your extra.srch files may need to be changed in order to be more specific. If you do not, some sort operations will give you a <tt>sortkey</tt> assertion failure.</font> </span>
| + | |
- | | + | |
- | <span><font color="#FF0000">Some examples of more specific searches in your extra.srch are provided below. The first relies on identifying metadata that has been specified through the use of attributes; the second merely chooses the first occurrence as an indication that it is the "main" value.</font></span>
| + | |
- | | + | |
- | (((region TITLE incl "type=main") within region TITLESTMT) within region SOURCEDESC);
| + | |
- | {exportfile "/l1/idx/e/example/maintitle.rgn"}; export; ~sync "maintitle";
| + | |
- | | + | |
- | (((region AUTHOR within (region "<TITLESTMT".."</AUTHOR>")) within (region
| + | |
- | TITLESTMT within region FILEDESC)) not within (region SOURCEDESC)); {exportfile
| + | |
- | "/l1/idx/e/example/mainauthor.rgn"}; export; ~sync "mainauthor";
| + | |
- | | + | |
- | ==[[Mounting the Collection Online]]==
| + | |
- | Text Class Collection to Web from workshop doc would go in here
| + | |
| | | |
| ==[[Working with Page Image Access Mechanisms in Text Class]]== | | ==[[Working with Page Image Access Mechanisms in Text Class]]== |
- | This document describes the mechanisms and programs used by DLXS for accessing and viewing images of pages that correspond to pages in TextClass documents (and possibly other classes). It also explains the particular metadata requirements that exist for this functionality to be possible and shows a sample pageview.dat file. The pageview.dat mechanisms are now deprecated, but still useful for importing information into the Pageview table (see instructions below in Populating the Pageview Table). Also discussed is the mechanism for linking to ImageClass to view Pageviewer images in more detail.
| |
- |
| |
- | * [#GeneralInfo General Information]
| |
- | ** [ivimageconversion.html Itemviewer Image Conversion]
| |
- | * [#pb The Page Break (PB) Element and Page Image Metadata]
| |
- | * [#pageviewTable Populating the Pageview Table]
| |
- | * [#pageviewDat Creating pageview.dat Files]
| |
- | * [#pageviewFig Linking from Pageviewer to ImageClass]
| |
- |
| |
- | = General Information =
| |
- |
| |
- | For collections where the middleware delivers page images rather than or in addition to the text content of the pages, the main mechanism for viewing the pages is the pageviewer-idx CGI program. In order to link from the XML text to the corresponding image, this CGI expects that page images are stored in directories based on (1) the DLXSROOT value, (2) the object directory recorded in the collection manager, and (3) the unique identifier assigned to the XML text and stored in the IDNO element, and that there are page break elements in the document referencing the images. In addition, there must be a Pageview table in the [../../collmeta/colldatabases.html dlxs metadata database] that should contain a column for each page image, listing the image file name, its sequence in the XML text, the page number (if any) specified on the page, the OCR confidence value (if available), and a three-letter code for any special features of the page (the default value for no special feature is UNS; see below for more information).
| |
- |
| |
- | <span class="command">pageviewer-idx</span> connects to the database and retrieves the name and location of the page image file from the Pageview table. <span class="command">pageviewer-idx</span> then decides how to deliver the page. If the stored file format is different from the requested format as recorded in the collection manager (e.g., stored as tiff and requested as gif), a separate program, tif2web, is started to convert the image on the fly. ''For more information about how pageviewer-idx does its work in deciding how to deliver the page image, see [ivimageconversion.html Itemviewer Image Conversion ]. ''
| |
- |
| |
- | = The Page Break Element and Page Image Metadata =
| |
- |
| |
- | The PB tag in the XML data, representing a page break, has this form in Text Class:
| |
- |
| |
- | <PB REF="00000009.tif" SEQ="00000009" RES="600dpi" FMT="TIFF6.0" FTR="TPG" CNF="856" N=iiii"/>
| |
- |
| |
- |
| |
- | The attributes are:
| |
- |
| |
- | * REF: file name of page image
| |
- | * SEQ: the sequence number of the page in the series, from start to finish, of all the pages in the document.
| |
- | * RES: the resolution of the page image.
| |
- | * FMT: the file format of the page image.
| |
- | * FTR: the feature of the page, given as a three letter code. Possible values are listed below.
| |
- | * CNF: the confidence value of the OCR for the page, given by the OCR software.
| |
- | * N: the page number, not as a sequence, but rather the number as printed on the page (e.g., 3, 96, ix, etc.). This may be left blank, but the attribute cannot be omitted.
| |
- |
| |
- | The information in this PB tag allows the Text Class middleware to create a URL to call the pageviewer-idx program with the parameters necessary to retrieve and display the corresponding page image. pageviewer-idx uses the Pageview table of the dlxs metadata database to do so.
| |
- |
| |
- | The following are examples of feature codes and their expanded definitions that have been used in various collections mounted by DLPS. You may redefine these or use other codes, but will need to add or change the values in the PageView.cfg file found in the Text Class cgi directory. To see an example of these codes in use in the Text Class interface, go to [http://www.hti.umich.edu/cgi/t/text/pageviewer-idx?c=moa;cc=moa;seq=00000019;idno=AGH8649.0001.001 this page from ''The Use of the Barometer''] in the Making of America and note the variety of features in the upper right hand pull down menu labeled "go to." If UNS is the sole feature recorded, no special features will be listed in this menu. Note that the FTR value becomes part of a lookup key in the langmap.en.xml to be referred to from the the pageviewer XSL.
| |
- |
| |
- | Here is an example:
| |
- |
| |
- | <blockquote>
| |
- |
| |
- |
| |
- | <Lookup id="viewer">
| |
- | <Item key="viewer.ftr.1stpg">First Page</Item>
| |
- | <Item key="viewer.ftr.ack">Acknowledgement</Item>
| |
- | <Item key="viewer.ftr.adv">Advertisement</Item>
| |
- | <Item key="viewer.ftr.app">Appendix</Item>
| |
- | <Item key="viewer.ftr.bib">Bibliography</Item>
| |
- | <Item key="viewer.ftr.blp">Blank Page</Item>
| |
- | <Item key="viewer.ftr.ctp">Cover Title Page</Item>
| |
- | <Item key="viewer.ftr.dig">Digest</Item>
| |
- | <Item key="viewer.ftr.err">Errata</Item>
| |
- | <Item key="viewer.ftr.fnt">Front Matter</Item>
| |
- | <Item key="viewer.ftr.his">History</Item>
| |
- | <Item key="viewer.ftr.ind">Comprehensive Index</Item>
| |
- | <Item key="viewer.ftr.loi">List of Illustrations</Item>
| |
- | <Item key="viewer.ftr.lot">List of Tables</Item>
| |
- | <Item key="viewer.ftr.map">Map</Item>
| |
- | <Item key="viewer.ftr.mis">Miscellaneous</Item>
| |
- | <Item key="viewer.ftr.mss">Manuscript</Item>
| |
- | <Item key="viewer.ftr.not">Notes</Item>
| |
- | <Item key="viewer.ftr.npn">[n/a]</Item>
| |
- | <Item key="viewer.ftr.ord">Ordinances</Item>
| |
- | <Item key="viewer.ftr.pni">Author or Name Index</Item>
| |
- | <Item key="viewer.ftr.pnt">Production Note</Item>
| |
- | <Item key="viewer.ftr.pre">Preface</Item>
| |
- | <Item key="viewer.ftr.ref">References</Item>
| |
- | <Item key="viewer.ftr.reg">Regulations</Item>
| |
- | <Item key="viewer.ftr.rul">Rules</Item>
| |
- | <Item key="viewer.ftr.spi">Special Index</Item>
| |
- | <Item key="viewer.ftr.sui">Subject Index</Item>
| |
- | <Item key="viewer.ftr.sup">Supplement</Item>
| |
- | <Item key="viewer.ftr.tab">Table</Item>
| |
- | <Item key="viewer.ftr.toc">Table of Contents</Item>
| |
- | <Item key="viewer.ftr.tpg">Title Page</Item>
| |
- | <Item key="viewer.ftr.uns"></Item>
| |
- | <Item key="viewer.ftr.ves">Volume End Sheets</Item>
| |
- | <Item key="viewer.ftr.vli">Volume List of Illus</Item>
| |
- | <Item key="viewer.ftr.voi">Volume Index</Item>
| |
- | <Item key="viewer.ftr.vpg">Various Pagination</Item>
| |
- | <Item key="viewer.ftr.vtp">Volume Title Page</Item>
| |
- | <Item key="viewer.ftr.vtv">Volume Title Page Verso</Item>
| |
- | <Item key="viewer.nopagenum">[n/a]</Item>
| |
- | </Lookup>
| |
- |
| |
- |
| |
- | </blockquote>
| |
- |
| |
- | = Populating the Pageview Table =
| |
- |
| |
- | In DLXS releases prior to CD-ROM 8, the pageview.dat, a tab-delimited ASCII file used to locate page images associated with a text, was stored in the directory with the page images for a particular collection. If you have created pageview.dat files and would like to migrate them to the Pageview table, [../../data/pageview.html instructions can be found here.] Otherwise, metadata about page images for a collection should be entered directly into the Pageview table.
| |
- |
| |
- | = Creating pageview.dat Files (For Information Only) =
| |
- |
| |
- | The pageview.dat file for a particular XML text can be automatically generated provided that the metadata required is stored as attributes in the page break (PB) elements in the text. On the distribution CD-ROM, in the directory /l1/bin/t/text/, you will find a perl script named makepageviewdata.pl. When run with a directory path as its sole argument, it will work through the subdirectories, creating pageview.dat files for all files with a .sgm* extension. (For XML files, you will need to edit lines 27 and 51 to point the script to files with the extension .xml.) For example,
| |
- |
| |
- | /l1/bin/t/text/makepageviewdata.pl /l1/obj/a/
| |
- |
| |
- | will run through '''all''' the subdirectories below /l1/obj/a/ and report on the files it finds and work it is doing:
| |
- |
| |
- | Working on xml files in directory: /l1/obj/a/j/l/ajl7777.0001.001
| |
- | Working on file: /l1/obj/a/j/l/ajl7777.0001.001/ajl7777.0001.001.xml
| |
- | Working on PB tag for sequence: 0001
| |
- | Working on PB tag for sequence: 0002
| |
- | Working on PB tag for sequence: 0003
| |
- | Working on PB tag for sequence: 0004
| |
- |
| |
- | Working with a document containing these four page break tags:
| |
- |
| |
- | <PB REF="00000001.tif" SEQ="00000001" RES="600dpi" FMT="TIFF6.0" FTR="TPG" CNF="852" N="1"/>
| |
- | <PB REF="00000002.tif" SEQ="00000002" RES="600dpi" FMT="TIFF6.0" FTR="UNSPEC" CNF="100" N="2"/>
| |
- | <PB REF="00000003.tif" SEQ="00000003" RES="600dpi" FMT="TIFF6.0" FTR="UNSPEC" CNF="884" N="3"/>
| |
- | <PB REF="00000004.tif" SEQ="00000004" RES="600dpi" FMT="TIFF6.0" FTR="UNSPEC" CNF="872" N="4"/>
| |
- |
| |
- | would result in a pageview.dat file that contained this data:
| |
- |
| |
- | <nowiki>## File: /l1/obj/b/a/b/bab3633.0001.001/pageview.dat
| |
- | ## Created: Mon Aug 6 11:32:55 EDT 2001
| |
- | ##
| |
- | #filename seq pagenum confid feature
| |
- | 00000001.tif 00000001 00000001 852 TPG
| |
- | 00000002.tif 00000002 00000002 100 UNS
| |
- | 00000003.tif 00000003 00000003 884 UNS
| |
- | 00000004.tif 00000004 00000004 872 UNS</nowiki>
| |
- |
| |
- | = Linking from Pageviewer to ImageClass =
| |
- |
| |
- | Pageviewer can handle both bitonal and continuous tone image formats. The output of a page capture process might be a bitonal scan of a page containing a continuous tone plate and a second continuous scan of the plate. Or the entire page might be captured as a continuous tone image. So there are two cases.
| |
- |
| |
- | * In the bitonal with continuous plate case, Pageviewer can display the bitonal page image and provide a link into ImageClass for the continuous tone plate. The link may be simple text or a thumbnail of the continuous tone plate (if thumbnail images are available).
| |
- | * If the entire page is a continuous tone image it may still be desirable to link to ImageClass for enhanced viewing opportunities provided there.
| |
- |
| |
- | Pageviewer can provide a links to the image if it is placed under ImageCLass management. The bitonal page with thumbnail link is shown in the following illustration.
| |
- |
| |
- | <center>[[Image:pageviewerlink.png]]</center>
| |
- |
| |
- | == Detailed Description of the Mechanism ==
| |
- |
| |
- | === ImageClass ===
| |
- |
| |
- | The continuous tone images and optional thumb images are placed under ImageClass management ase described in [../image/imageloading.html Image Class Image Loading for Online Delivery].
| |
- |
| |
- | === DLXS idresolver CGI and DLXS database idresolver database table ===
| |
- |
| |
- | The DLXS IdResolver CGI script is delivered in <tt>DLXSROOT/cgi/i/idresolver/idresolver</tt>. It depends on a table in the DLXS database called <tt>idresolver</tt>. Check <tt>DLXSROOT/misc/db/MySQL-dump.sql</tt> for the schema. A row consists of an ID and a URL to ImageClass for that ID.
| |
- |
| |
- | The default format of the ID is <tt>ic-''collid-idno''-''ENTITY''-[th|rec]</tt> where '''th''' is for the thumb and '''''rec''''' is for the full sized image. '''''ENTITY''''' is the value of the '''ENTITY''' attribute in the <FIGURE> tag in the XML. Simple bitonal page collections use the <PB> tag to indicate pages. If the page has a continuous tone plate or is itself a continuous tone image for which ImageClass viewing is desirable, the <FIGURE> tag is used in addition to the <PB> tag.
| |
- |
| |
- | The virtual host running the idresolver CGI is assigned to <tt>$LibGlobals::gIdResolverHost</tt> via an install-time question. It can be changed later by editing DLXSROOT/lib/LibGlobals.cfg.
| |
- |
| |
- | Population of the <tt>idresolver</tt> table is accomplished by a variety of local mechanisms which will vary depending upon your data preparation processes.
| |
- |
| |
- | === DLXS databse PageviewFig table ===
| |
- |
| |
- | The final component to this mechanism is the DLXS database <tt>PageviewFig</tt> table.
| |
| | | |
- | This table is consulted by Pageviewer to determine whether to present links to ImageClass when a given page is displayed. Check <tt>DLXSROOT/misc/db/MySQL-dump.sql</tt> for the schema. A row consists of (1) the <tt>idno</tt> of the item, (2) the page sequence number (<tt>seq</tt>) from the '''SEQ''' attribute of the <PB> tag for the page being viewed and (3) the value of the '''ENTITY''' attribute from the <FIGURE> tag accompanying the <PB> tag in the XML.
| + | ==[[Working with the Text Class User Interface]]== |
| | | |
- | The <tt>PageviewFig</tt> table can be populated by the <tt>DLXSROOT/bin/t/text/processPageviewFig.pl</tt> Perl script. It accepts a single command-line option with values of '''-a''' to insert rows and '''-d''' to delete rows. The script will prompt for DLXSROOT and for a path relative to DLXSROOT where the XPAT Data Dictionary for the collection resides, e.g. <tt>/idx/c/coll/coll.dd</tt>. The script will perform XPAT queries and parse the '''ENTITY''' attribute value from the <FIGURE> tag and the '''SEQ''' attribute of the containing <PB> tag and the <tt>idno</tt> of the containing item. The result is inserted into the table. Deletion is accomplished by deleting all rows with <tt>idno</tt> values for the collection in question.
| + | ==[[Creating Text Class Wordwheels]]== |
| | | |
- | === Putting it all together === | + | ==[[Newspaper Clip Image Access Mechanisms]]== |
| | | |
- | When Pageviewer displays the image for a given page ('''seq''') for a given item ('''idno''') in some collection it uses these keys to look up the corresponding '''ENTITY''' value in the PageviewFig table. It constructs two IDs from the '''ENTITY''' for lookup in the <tt>idresolver</tt> table. It then builds links using the URLs to ImageClass that come back in the query result. If the thumbnail image URL is empty it builds a text link instead of a thumb link.
| + | ==[[Workflows for delivering PDF files as page images through pageturner]]== |
| | | |
- | ==[[Working with the User Interface]]== | + | ==[[Picklist customization for serial articles]]== |
- | ===[[Specifying Individual Collection Characteristics]]=== | + | <span class="release_15">'''Release_15'''</span> |
- | Could we re-label this topic with a friendlier title? Perhaps "Customizing Text Class UI Layout"or something.
| + | |
- | ===[[Text Class XML Template Files]]===
| + | |
- | ===[[Text Class Processing Instructions]]===
| + | |
- | ===[[Text Class Graphics Files]]===
| + | |
| | | |
| ==[[Text Class Troubleshooting]]== | | ==[[Text Class Troubleshooting]]== |
| + | This section should include Text Class problems and solutions--user-contributed material is welcome! |
| | | |
| | | |
| [[#top|Top]] | | [[#top|Top]] |
The decision process for inclusion of content in the Text Class is roughly based on the following:
These are, at best, general guidelines for decision making based on the current state of the Text Class implementation at the University of Michigan.
This section describes the the Text Class DTD and mechanisms and programs used by DLXS for accessing and displaying images that are defined by the FIGURE element in TextClass document markup.
This section should include Text Class problems and solutions--user-contributed material is welcome!