Working with Text Class Markup

From DLXS Documentation

(Difference between revisions)
Jump to: navigation, search
Current revision (15:28, 15 October 2009) (edit) (undo)
(Basic Requirements: valid values of EDITORIALDECL)
 
(4 intermediate revisions not shown.)
Line 1: Line 1:
-
[[DLXS Wiki|Main Page]] > [[Mounting Collections: Class-specific Steps]] > [[Mounting a Text Class Collection]] > FIGURE Resolution
+
[[DLXS Wiki|Main Page]] > [[Mounting Collections: Class-specific Steps]] > [[Mounting a Text Class Collection]] > Working with Text Class Markup
==Working with the Text Class DTD==
==Working with the Text Class DTD==
 +
 +
This document lays out the elements and attributes required in all Text Class texts. It is assumed that any texts to be converted to Text Class already validate against another DTD for encoding monographic materials, such as TEI Lite, that represent corresponding structures (chapters, pages, etc.). Because of the extremely lax content modelling (almost every element is defined to permit ANY of the named elements), the Text Class DTD is useful only to ensure that the nomenclatures have been changed appropriately.
 +
 +
===Basic Requirements===
 +
 +
The following elements and attributes are required:
 +
 +
* DLPSTEXTCLASS
 +
* HEADER
 +
* EDITORIALDECL with N attribute according to [http://www.indiana.edu/~letrs/tei/ TEI in Libraries recommendations] for encoding levels. The valid values are "1", "2", and "4".
 +
* IDNO -- must be unique for each text in a collection, and should not contain spaces
 +
* TEXT
 +
* BODY
 +
* Nested numbered DIVs, with at least one DIV1
 +
** Each DIV must have a NODE attribute; a TYPE attribute is recommended
 +
** All content, such as pagebreaks, should be in DIVs, not between, if at all possible
 +
* Nested numbered elements that contain themselves
 +
** For example, NOTE, Q, HI become NOTE1, Q1, H1 containing NOTE2, etc.
 +
* NOTE1 and PTR must have N and ID attributes
 +
** default value for N is *; IDs will be automatically assigned for elements without them
 +
 +
See the following example document:
 +
 +
  &lt;DLPSTEXTCLASS&gt;<br /> &lt;HEADER&gt;<br /> &lt;FILEDESC&gt;<br /> &lt;TITLESTMT&gt;<br /> 
 +
  &lt;TITLE TYPE="245"&gt;Some Book&lt;/TITLE&gt;<br /> &lt;AUTHOR&gt;Some Author&lt;/AUTHOR&gt;<br/> &lt;/TITLESTMT&gt;<br /> &lt;EXTENT&gt;1 600dpi TIFF G4 page image&lt;/EXTENT&gt;<br /> &lt;PUBLICATIONSTMT&gt;<br /> &lt;PUBLISHER&gt;Humanities Text Initiative&lt;/PUBLISHER&gt;<br /> &lt;PUBPLACE&gt;University of Michigan&lt;/PUBPLACE&gt;<br /> &lt;IDNO TYPE="dlps"&gt;acw8433.1-01.001&lt;/IDNO&gt;<br /> &lt;AVAILABILITY&gt;<br /> &lt;P&gt;These pages may be freely searched and displayed. Permission must be received for subsequent distribution in print or electronically. Please contact umdl-info@umich.edu for more information.&lt;/P&gt;<br /> &lt;/AVAILABILITY&gt;<br /> &lt;/PUBLICATIONSTMT&gt;<br /> &lt;SOURCEDESC&gt;<br /> &lt;BIBL&gt;<br /> &lt;TITLE TYPE="main"&gt;Some Book&lt;/TITLE&gt;<br /> &lt;AUTHOR&gt;Some Author&lt;/AUTHOR&gt;<br /> &lt;PUBLISHER&gt;D. Appleton and Company.&lt;/PUBLISHER&gt;<br /> &lt;PUBPLACE&gt;New York,&lt;/PUBPLACE&gt;<br /> &lt;DATE&gt;Apr 3, 1869&lt;/DATE&gt;<br /> &lt;/BIBL&gt;<br /> &lt;/SOURCEDESC&gt;<br /> &lt;/FILEDESC&gt;<br /> &lt;ENCODINGDESC&gt;<br /> &lt;PROJECTDESC&gt;<br /> &lt;P&gt;Revised with script moajrnltotextclass.pl on 2001-11-21.&lt;/P&gt;<br /> &lt;/PROJECTDESC&gt;<br /> &lt;EDITORIALDECL N="1"&gt;<br /> &lt;P&gt;This electronic text file was created by Optical Character Recognition (OCR). No corrections have been made to the OCR-ed text and no editing has been done to the content of the original document. Encoding has been done through an automated process using the recommendations for Level 1 of the TEI in Libraries Guidelines. Digital page images are linked to the text file.&lt;/P&gt;<br /> &lt;/EDITORIALDECL&gt;<br /> &lt;/ENCODINGDESC&gt;<br /> &lt;PROFILEDESC&gt;<br /> &lt;TEXTCLASS&gt;<br /> &lt;KEYWORDS&gt;<br /> &lt;TERM&gt;Equations, Abelian.&lt;/TERM&gt;<br /> &lt;/KEYWORDS&gt;<br /> &lt;/TEXTCLASS&gt;<br /> &lt;/PROFILEDESC&gt;<br /> &lt;/HEADER&gt;<br /> &lt;TEXT&gt;<br /> &lt;BODY&gt;<br /> &lt;DIV1 NODE="acw8433.1-01.001:1"&gt;<br /> &lt;P&gt;<br /> &lt;PB REF="00000001.tif" SEQ="00000001" RES="600dpi" FMT="TIFF6.0" FTR="UNSPEC" CNF="874" N=""/&gt;asdfqwerwerwer &lt;NOTE1 N="*" ID="note0001"&gt;see asdfasdft&lt;/NOTE1&gt;&lt;/P&gt;<br /> &lt;/DIV1&gt;<br /> &lt;/BODY&gt;<br /> &lt;/TEXT&gt;<br /> &lt;/DLPSTEXTCLASS&gt;
 +
 +
==Build the Collection Specific Text Class DTD==
 +
 +
Before indexing your collection, you will need to create a collection-specific Text Class DTD. Because the class DTD supports any element having any of the declared attributes (for convenience of DTD creation), indexing "as-is" with XPATu will cause your index to be unnecessarily large. This may also cause problems for XML validation tools. You can create your own collection specific version of the Text Class DTD by running the following command:
 +
 +
  egrep -i "<\!ELEMENT" $DLXSROOT/misc/sgml/textclass.xml.dtd > textclass.stripped.xml.dtd
 +
 +
We'll use the "make dtd" command from the Makefile to determine which attributes are used in your collection and build a custom DTD by concatenating it with <tt>$DLXSROOT/misc/xml/textclass.stripped.xml.dtd</tt>. Using the "make validate" command will then validate your collection against the new DTD. If the individual texts validated before, they should validate as a concatenated collection now.
 +
 +
  cd $DLXSROOT/bin/w/workshoptc
 +
  make dtd
 +
  make validate
==FIGURE Resolution==
==FIGURE Resolution==

Current revision

Main Page > Mounting Collections: Class-specific Steps > Mounting a Text Class Collection > Working with Text Class Markup

Contents

[edit] Working with the Text Class DTD

This document lays out the elements and attributes required in all Text Class texts. It is assumed that any texts to be converted to Text Class already validate against another DTD for encoding monographic materials, such as TEI Lite, that represent corresponding structures (chapters, pages, etc.). Because of the extremely lax content modelling (almost every element is defined to permit ANY of the named elements), the Text Class DTD is useful only to ensure that the nomenclatures have been changed appropriately.

[edit] Basic Requirements

The following elements and attributes are required:

  • DLPSTEXTCLASS
  • HEADER
  • EDITORIALDECL with N attribute according to TEI in Libraries recommendations for encoding levels. The valid values are "1", "2", and "4".
  • IDNO -- must be unique for each text in a collection, and should not contain spaces
  • TEXT
  • BODY
  • Nested numbered DIVs, with at least one DIV1
    • Each DIV must have a NODE attribute; a TYPE attribute is recommended
    • All content, such as pagebreaks, should be in DIVs, not between, if at all possible
  • Nested numbered elements that contain themselves
    • For example, NOTE, Q, HI become NOTE1, Q1, H1 containing NOTE2, etc.
  • NOTE1 and PTR must have N and ID attributes
    • default value for N is *; IDs will be automatically assigned for elements without them

See the following example document:

 <DLPSTEXTCLASS>
<HEADER>
<FILEDESC>
<TITLESTMT>
<TITLE TYPE="245">Some Book</TITLE>
<AUTHOR>Some Author</AUTHOR>
</TITLESTMT>
<EXTENT>1 600dpi TIFF G4 page image</EXTENT>
<PUBLICATIONSTMT>
<PUBLISHER>Humanities Text Initiative</PUBLISHER>
<PUBPLACE>University of Michigan</PUBPLACE>
<IDNO TYPE="dlps">acw8433.1-01.001</IDNO>
<AVAILABILITY>
<P>These pages may be freely searched and displayed. Permission must be received for subsequent distribution in print or electronically. Please contact umdl-info@umich.edu for more information.</P>
</AVAILABILITY>
</PUBLICATIONSTMT>
<SOURCEDESC>
<BIBL>
<TITLE TYPE="main">Some Book</TITLE>
<AUTHOR>Some Author</AUTHOR>
<PUBLISHER>D. Appleton and Company.</PUBLISHER>
<PUBPLACE>New York,</PUBPLACE>
<DATE>Apr 3, 1869</DATE>
</BIBL>
</SOURCEDESC>
</FILEDESC>
<ENCODINGDESC>
<PROJECTDESC>
<P>Revised with script moajrnltotextclass.pl on 2001-11-21.</P>
</PROJECTDESC>
<EDITORIALDECL N="1">
<P>This electronic text file was created by Optical Character Recognition (OCR). No corrections have been made to the OCR-ed text and no editing has been done to the content of the original document. Encoding has been done through an automated process using the recommendations for Level 1 of the TEI in Libraries Guidelines. Digital page images are linked to the text file.</P>
</EDITORIALDECL>
</ENCODINGDESC>
<PROFILEDESC>
<TEXTCLASS>
<KEYWORDS>
<TERM>Equations, Abelian.</TERM>
</KEYWORDS>
</TEXTCLASS>
</PROFILEDESC>
</HEADER>
<TEXT>
<BODY>
<DIV1 NODE="acw8433.1-01.001:1">
<P>
<PB REF="00000001.tif" SEQ="00000001" RES="600dpi" FMT="TIFF6.0" FTR="UNSPEC" CNF="874" N=""/>asdfqwerwerwer <NOTE1 N="*" ID="note0001">see asdfasdft</NOTE1></P>
</DIV1>
</BODY>
</TEXT>
</DLPSTEXTCLASS>

[edit] Build the Collection Specific Text Class DTD

Before indexing your collection, you will need to create a collection-specific Text Class DTD. Because the class DTD supports any element having any of the declared attributes (for convenience of DTD creation), indexing "as-is" with XPATu will cause your index to be unnecessarily large. This may also cause problems for XML validation tools. You can create your own collection specific version of the Text Class DTD by running the following command:

 egrep -i "<\!ELEMENT" $DLXSROOT/misc/sgml/textclass.xml.dtd > textclass.stripped.xml.dtd

We'll use the "make dtd" command from the Makefile to determine which attributes are used in your collection and build a custom DTD by concatenating it with $DLXSROOT/misc/xml/textclass.stripped.xml.dtd. Using the "make validate" command will then validate your collection against the new DTD. If the individual texts validated before, they should validate as a concatenated collection now.

 cd $DLXSROOT/bin/w/workshoptc
 make dtd
 make validate

[edit] FIGURE Resolution

[edit] The FIGURE Element

The FIGURE element is used in TextClass markup to encode the occurrence of a figure in a text document. The FIGURE element's ENTITY attribute carries the ID that the middleware resolves to an image file on disk or to an image under management by ImageClass.

If the document is a TEI Level 1 (page image document viewable in Pageviewer) the FIGURE tag supplies additional data about the bitonal page image beyond that supplied by the PB tag. The additional data might be a second continuous tone scan of the entire page or a continuous tone scan of a plate within the page. The middleware makes the contone images viewable under ImageClass via links built in the full OCR text of the document (in cases where that OCR is displayable) or via [pageviewer.html#pageviewFig additional links in Pageviewer]. For higher TEI levels, the FIGURE tag typically calls out the occurrence of an inline figure, the image for which is usually on disk but may also be retrievable from ImageClass through the IdResolver mechanism described in the [pageviewer.html#DLXS_IDResolver IdResolver section] of the Pageviewer documentation.

The balance of this document describes how the FIGURE element is resolved into an image via two mechanisms: Filesystem resolution and IdResolver resolution. Also described is how the resolution can be affected by writing subclasses of TextClass.

[edit] FIGURE Resolution in General

The resolution mechanism is summarized in the following diagram

<FIGURE ENTITY="abc" --> transform "abc" to key --> lookup key --> URL or file system path

The document markup is parsed and the ENTITY attribute value of a given FIGURE tag is extracted. The attribute value is transformed into a key/path suitable for lookup via the DLXS IdResolver or by looking on the disk. If IdResolver is used, the corresponding ImageClass URL is returned. If the filesystem is used, the path to the file in the web space is determined. This URL/path becomes the value of a new attribute which is added to the FIGURE tag and passed along for eventual processing by the XSL stylesheet (principally text.components.xsl). The XSL stylesheet typically transforms the FIGURE tag together with its attributes into an HTML img tag, possibly wrapped with an HTML anchor (a) tag, if a popup window or link into ImageClass is required to view the image of the figure.

Actually, depending on the situation, the middleware adds two or three new attributes to the FIGURE tag for XSL processing. They are FIGTYPE, HREF_1 and HREF_2. Collection-specific configuration referred to below is discussed in the [#defaultbehavior configuration section] later in this document.

  1. FIGTYPE carries one of two values:
    • INLINE when the configuration is for figures that appear inline with the text. In this case, only of the HREF_1 attribute will appear.
    • THUMB when the configuration is for figures that have thumbnail images that appear inline with the text. In this case, both of the HREF_* attributes will appear.
  2. HREF_1 is the URL or web space path to the inline image when FIGTYPE=INLINE or to a thumbnail image of the full sized image when FIGTYPE=THUMB.
  3. HREF_2 is the URL or web space path to the full sized version of the figure image when FIGTYPE=THUMB.

Note: The debug=resolver URL parameter can be added to the end of the URL to see the action of the resolver as it operates on the ENTITY attribute of the FIGURE tag.

[edit] The IdResolver Resolution Mechanism

If the configuration indicates figure images are "external", i.e. that the figure images are managed in ImageClass or by a 3rd-party host, the IdResolver is used to resolve the ID to an ImageClass or 3rd-party URL. The Idresolver mechanism is described in the [pageviewer.html#DLXS_IDResolver IdResolver section] of the Pageviewer documentation.

[edit] The Filesystem Resolution Mechanism

If the configuration indicates figure images are not "external", the figure images should be found as files in the web space. The middleware constructs a default path to the image in the web speace as /webdir/images/ENTITY.extension where webdir is the collmgr value for the collection and extension comes from a list (.gif, .jpg, etc.). The middleware tests for file existence in the web space for each extension until a hit occurs. This allows files of several different formats to coexist in the web space.

The DLXS directory convention is to store these image files in DLXSROOT/img/c/coll and make a symbolic link to that directoy in DLXSROOT/web/c/coll called images.

[edit] Default Behavior and Custom Configuration

The default for the figure resolution mechanism assumes all figures are inline, on disk, without corresponding thumbnail images. Within this constraint it is possible to change the way the path to the disk file is generated to derive a number of naming conventions based on the bare entity attribute value. This is described below.

Modifying the described configuration of the figure resolution mechanism is accomplished by writing a subclass of the TextClass package for each affected collection. The size of the methods that need to be written is small, typically just a line or two of code.

Following is a synopsis of the methods provided for subclassing. Please consult the code in DLXSROOT/cgi/t/text/TextClass.pm and its subclasses for more detail.

  • ExternalFigFullImage by default returns 0 indicating that figure images are on disk. Override this and return 1 to indicate images should be resolved into ImageClass or 3rd-party URLs by IdResolver.
  • ExternalFigThumb by default returns 0 indicating that thumbnail images of figures are on disk. Override this and return 1 to indicate thumbnail images should be resolved into ImageClass or 3rd-party URLs by IdResolver.
  • CollUsesThumbs by default returns 0. Override this and return 1 to indicate that figures should be rendered as inline thumbnails linked to a full sized version of the image. Note that thumbnails and their full sized counterparts can reside under ImageClass or be accessed directly from the disk.
  • GenerateExternalId returns the default format for the IdResolver ID (key) as ic-collid-idno-ENTITY-[th|rec] where th is for the thumb and rec is for the full sized image. ENTITY is the value of the ENTITY attribute in the <FIGURE> tag in the XML. By overriding this method, different ID schemes can be implemented.
  • GenerateFilesysPathElement by default simply returns the element for concatenation as described above. In the case where CollUsesThumbs is true and thumbnails and full sized images are on disk, this method can be overridden to supply different names for the thumbnails and full sized image files.
  • FigureUriHook by default returns the URL passed in. For collections where the URL of the image needs to be wrapped in a javascript call this method can be subclassed.
  • SyntheticAttrsHook can be subclassed to add additional attributes to the FIGURE tag for specialized process by the XSL stylesheets for collection-specific customization.
  • FilesysIdResolver This method can be subclassed to supply a custom list of file extensions to check if the base class default list is not inclusive enough. In the subclass implementation, the list is passed as the optional last parameter to an invocation of the base class method, i.e. as SUPER::FilesysIdResolver.

Top

Personal tools