Mounting a Text Class Collection

From DLXS Documentation

(Difference between revisions)

Jump to: navigation, search

Current revision

Main Page > Mounting Collections: Class-specific Steps > Mounting a Text Class Collection

1 Text Class Overview
2 Examples of Text Class Implementations and Practices
3 Working with Text Class Markup
4 Working with Text Class Data and Directories
5 Data Preparation and Conversion: Unicode, XML, and Normalization
6 Working with Fabricated Regions
7 Building the Text Class Index
8 Mounting the Text Class Collection Online
9 Adding Text Class Search Functionality
10 Working with Page Image Access Mechanisms in Text Class
11 Working with the Text Class User Interface
12 Creating Text Class Wordwheels
13 Newspaper Clip Image Access Mechanisms
14 Workflows for delivering PDF files as page images through pageturner
15 Picklist customization for serial articles
16 Text Class Troubleshooting

[edit] Text Class Overview

[edit] Examples of Text Class Implementations and Practices

Making of America, Twentieth Century American Poetry.

The decision process for inclusion of content in the Text Class is roughly based on the following:

The text is a monograph or journal issue available in SGML or XML.
Each text has a unique ID.
The campus and/or world community would benefit from access to the information.

These are, at best, general guidelines for decision making based on the current state of the Text Class implementation at the University of Michigan.

[edit] Working with Text Class Markup

This section describes the the Text Class DTD and mechanisms and programs used by DLXS for accessing and displaying images that are defined by the FIGURE element in TextClass document markup.

[edit] Working with Text Class Data and Directories

[edit] Data Preparation and Conversion: Unicode, XML, and Normalization

[edit] Working with Fabricated Regions

[edit] Building the Text Class Index

[edit] Mounting the Text Class Collection Online

[edit] Adding Text Class Search Functionality

[edit] Working with Page Image Access Mechanisms in Text Class

[edit] Working with the Text Class User Interface

[edit] Creating Text Class Wordwheels

[edit] Newspaper Clip Image Access Mechanisms

[edit] Workflows for delivering PDF files as page images through pageturner

[edit] Picklist customization for serial articles

Release_15

[edit] Text Class Troubleshooting

This section should include Text Class problems and solutions--user-contributed material is welcome!

Top

@@ Line 3: / Line 3: @@
 ==[[Text Class Overview]]==
-==[[Examples of Text Class Implementations and Practices]]==
+==Examples of Text Class Implementations and Practices==
 [http://moa.umdl.umich.edu Making of America], [http://qoud.lib.umich.edu/a/ampo20/ Twentieth Century American Poetry].
@@ Line 23: / Line 23: @@
 ==[[Data Preparation and Conversion: Unicode, XML, and Normalization]]==
 ==[[Working with Fabricated Regions]]==
-==[[Customizing Text Class]]==
-This could provide other topics, but I think it would be a good place for a discussion of the pieces of text class that are customizable ranked by level of difficulty. Then you could link to the places in the doc that discuss specific methods for customizing different bits.
 ==[[Building the Text Class Index]]==
-==[[Mounting the Collection Online]]==
+==[[Mounting the Text Class Collection Online]]==
-Text Class Collection to Web from workshop doc would go in here
+==[[Adding Text Class Search Functionality]]==
 ==[[Working with Page Image Access Mechanisms in Text Class]]==
-This document describes the mechanisms and programs used by DLXS for accessing and viewing images of pages that correspond to pages in TextClass documents (and possibly other classes). It also explains the particular metadata requirements that exist for this functionality to be possible and shows a sample pageview.dat file. The pageview.dat mechanisms are now deprecated, but still useful for importing information into the Pageview table (see instructions below in Populating the Pageview Table). Also discussed is the mechanism for linking to ImageClass to view Pageviewer images in more detail.
-* [#GeneralInfo General Information]
-** [ivimageconversion.html Itemviewer Image Conversion]
-* [#pb The Page Break (PB) Element and Page Image Metadata]
-* [#pageviewTable Populating the Pageview Table]
-* [#pageviewDat Creating pageview.dat Files]
-* [#pageviewFig Linking from Pageviewer to ImageClass]
-= General Information =
-For collections where the middleware delivers page images rather than or in addition to the text content of the pages, the main mechanism for viewing the pages is the pageviewer-idx CGI program. In order to link from the XML text to the corresponding image, this CGI expects that page images are stored in directories based on (1) the DLXSROOT value, (2) the object directory recorded in the collection manager, and (3) the unique identifier assigned to the XML text and stored in the IDNO element, and that there are page break elements in the document referencing the images. In addition, there must be a Pageview table in the [../../collmeta/colldatabases.html dlxs metadata database] that should contain a column for each page image, listing the image file name, its sequence in the XML text, the page number (if any) specified on the page, the OCR confidence value (if available), and a three-letter code for any special features of the page (the default value for no special feature is UNS; see below for more information).
-<span class="command">pageviewer-idx</span> connects to the database and retrieves the name and location of the page image file from the Pageview table. <span class="command">pageviewer-idx</span> then decides how to deliver the page. If the stored file format is different from the requested format as recorded in the collection manager (e.g., stored as tiff and requested as gif), a separate program, tif2web, is started to convert the image on the fly. ''For more information about how pageviewer-idx does its work in deciding how to deliver the page image, see [ivimageconversion.html Itemviewer Image Conversion ]. ''
-= The Page Break Element and Page Image Metadata =
-The PB tag in the XML data, representing a page break, has this form in Text Class:
- &lt;PB REF="00000009.tif" SEQ="00000009" RES="600dpi" FMT="TIFF6.0" FTR="TPG" CNF="856" N=iiii"/&gt;
-The attributes are:
-* REF: file name of page image
-* SEQ: the sequence number of the page in the series, from start to finish, of all the pages in the document.
-* RES: the resolution of the page image.
-* FMT: the file format of the page image.
-* FTR: the feature of the page, given as a three letter code. Possible values are listed below.
-* CNF: the confidence value of the OCR for the page, given by the OCR software.
-* N: the page number, not as a sequence, but rather the number as printed on the page (e.g., 3, 96, ix, etc.). This may be left blank, but the attribute cannot be omitted.
-The information in this PB tag allows the Text Class middleware to create a URL to call the pageviewer-idx program with the parameters necessary to retrieve and display the corresponding page image. pageviewer-idx uses the Pageview table of the dlxs metadata database to do so.
-The following are examples of feature codes and their expanded definitions that have been used in various collections mounted by DLPS. You may redefine these or use other codes, but will need to add or change the values in the PageView.cfg file found in the Text Class cgi directory. To see an example of these codes in use in the Text Class interface, go to [http://www.hti.umich.edu/cgi/t/text/pageviewer-idx?c=moa;cc=moa;seq=00000019;idno=AGH8649.0001.001 this page from ''The Use of the Barometer''] in the Making of America and note the variety of features in the upper right hand pull down menu labeled "go to." If UNS is the sole feature recorded, no special features will be listed in this menu. Note that the FTR value becomes part of a lookup key in the langmap.en.xml to be referred to from the the pageviewer XSL.
-Here is an example:
-<blockquote>
-         &lt;Lookup id="viewer"&gt;
-           &lt;Item key="viewer.ftr.1stpg"&gt;First Page&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.ack"&gt;Acknowledgement&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.adv"&gt;Advertisement&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.app"&gt;Appendix&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.bib"&gt;Bibliography&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.blp"&gt;Blank Page&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.ctp"&gt;Cover Title Page&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.dig"&gt;Digest&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.err"&gt;Errata&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.fnt"&gt;Front Matter&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.his"&gt;History&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.ind"&gt;Comprehensive Index&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.loi"&gt;List of Illustrations&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.lot"&gt;List of Tables&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.map"&gt;Map&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.mis"&gt;Miscellaneous&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.mss"&gt;Manuscript&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.not"&gt;Notes&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.npn"&gt;[n/a]&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.ord"&gt;Ordinances&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.pni"&gt;Author or Name Index&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.pnt"&gt;Production Note&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.pre"&gt;Preface&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.ref"&gt;References&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.reg"&gt;Regulations&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.rul"&gt;Rules&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.spi"&gt;Special Index&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.sui"&gt;Subject Index&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.sup"&gt;Supplement&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.tab"&gt;Table&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.toc"&gt;Table of Contents&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.tpg"&gt;Title Page&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.uns"&gt;&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.ves"&gt;Volume End Sheets&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.vli"&gt;Volume List of Illus&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.voi"&gt;Volume Index&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.vpg"&gt;Various Pagination&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.vtp"&gt;Volume Title Page&lt;/Item&gt;
-           &lt;Item key="viewer.ftr.vtv"&gt;Volume Title Page Verso&lt;/Item&gt;
-           &lt;Item key="viewer.nopagenum"&gt;[n/a]&lt;/Item&gt;
-         &lt;/Lookup&gt;
-</blockquote>
-= Populating the Pageview Table =
-In DLXS releases prior to CD-ROM 8, the pageview.dat, a tab-delimited ASCII file used to locate page images associated with a text, was stored in the directory with the page images for a particular collection. If you have created pageview.dat files and would like to migrate them to the Pageview table, [../../data/pageview.html instructions can be found here.] Otherwise, metadata about page images for a collection should be entered directly into the Pageview table.
-= Creating pageview.dat Files (For Information Only) =
-The pageview.dat file for a particular XML text can be automatically generated provided that the metadata required is stored as attributes in the page break (PB) elements in the text. On the distribution CD-ROM, in the directory /l1/bin/t/text/, you will find a perl script named makepageviewdata.pl. When run with a directory path as its sole argument, it will work through the subdirectories, creating pageview.dat files for all files with a .sgm* extension. (For XML files, you will need to edit lines 27 and 51 to point the script to files with the extension .xml.) For example,
- /l1/bin/t/text/makepageviewdata.pl /l1/obj/a/
-will run through '''all''' the subdirectories below /l1/obj/a/ and report on the files it finds and work it is doing:
- Working on xml files in directory: /l1/obj/a/j/l/ajl7777.0001.001
- Working on file: /l1/obj/a/j/l/ajl7777.0001.001/ajl7777.0001.001.xml
- Working on PB tag for sequence: 0001
- Working on PB tag for sequence: 0002
- Working on PB tag for sequence: 0003
- Working on PB tag for sequence: 0004
-Working with a document containing these four page break tags:
- &lt;PB REF="00000001.tif" SEQ="00000001" RES="600dpi" FMT="TIFF6.0" FTR="TPG" CNF="852" N="1"/&gt;
- &lt;PB REF="00000002.tif" SEQ="00000002" RES="600dpi" FMT="TIFF6.0" FTR="UNSPEC" CNF="100" N="2"/&gt;
- &lt;PB REF="00000003.tif" SEQ="00000003" RES="600dpi" FMT="TIFF6.0" FTR="UNSPEC" CNF="884" N="3"/&gt;
- &lt;PB REF="00000004.tif" SEQ="00000004" RES="600dpi" FMT="TIFF6.0" FTR="UNSPEC" CNF="872" N="4"/&gt;
-would result in a pageview.dat file that contained this data:
- <nowiki>## File:        /l1/obj/b/a/b/bab3633.0001.001/pageview.dat
- ## Created:     Mon Aug  6 11:32:55 EDT 2001
- ##
- #filename       seq       pagenum confid  feature
- 00000001.tif    00000001        00000001        852     TPG
- 00000002.tif    00000002        00000002        100     UNS
- 00000003.tif    00000003        00000003        884     UNS
- 00000004.tif    00000004        00000004        872     UNS</nowiki>
-= Linking from Pageviewer to ImageClass =
-Pageviewer can handle both bitonal and continuous tone image formats. The output of a page capture process might be a bitonal scan of a page containing a continuous tone plate and a second continuous scan of the plate. Or the entire page might be captured as a continuous tone image. So there are two cases.
-* In the bitonal with continuous plate case, Pageviewer can display the bitonal page image and provide a link into ImageClass for the continuous tone plate. The link may be simple text or a thumbnail of the continuous tone plate (if thumbnail images are available).
-* If the entire page is a continuous tone image it may still be desirable to link to ImageClass for enhanced viewing opportunities provided there.
-Pageviewer can provide a links to the image if it is placed under ImageCLass management. The bitonal page with thumbnail link is shown in the following illustration.
-<center>[[Image:pageviewerlink.png]]</center>
-== Detailed Description of the Mechanism ==
-=== ImageClass ===
-The continuous tone images and optional thumb images are placed under ImageClass management ase described in [../image/imageloading.html Image Class Image Loading for Online Delivery].
-=== DLXS idresolver CGI and DLXS database idresolver database table ===
-The DLXS IdResolver CGI script is delivered in <tt>DLXSROOT/cgi/i/idresolver/idresolver</tt>. It depends on a table in the DLXS database called <tt>idresolver</tt>. Check <tt>DLXSROOT/misc/db/MySQL-dump.sql</tt> for the schema. A row consists of an ID and a URL to ImageClass for that ID.
-The default format of the ID is <tt>ic-''collid-idno''-''ENTITY''-[th|rec]</tt> where '''th''' is for the thumb and '''''rec''''' is for the full sized image. '''''ENTITY''''' is the value of the '''ENTITY''' attribute in the &lt;FIGURE&gt; tag in the XML. Simple bitonal page collections use the &lt;PB&gt; tag to indicate pages. If the page has a continuous tone plate or is itself a continuous tone image for which ImageClass viewing is desirable, the &lt;FIGURE&gt; tag is used in addition to the &lt;PB&gt; tag.
-The virtual host running the idresolver CGI is assigned to <tt>$LibGlobals::gIdResolverHost</tt> via an install-time question. It can be changed later by editing DLXSROOT/lib/LibGlobals.cfg.
-Population of the <tt>idresolver</tt> table is accomplished by a variety of local mechanisms which will vary depending upon your data preparation processes.
-=== DLXS databse PageviewFig table ===
-The final component to this mechanism is the DLXS database <tt>PageviewFig</tt> table.
-This table is consulted by Pageviewer to determine whether to present links to ImageClass when a given page is displayed. Check <tt>DLXSROOT/misc/db/MySQL-dump.sql</tt> for the schema. A row consists of (1) the <tt>idno</tt> of the item, (2) the page sequence number (<tt>seq</tt>) from the '''SEQ''' attribute of the &lt;PB&gt; tag for the page being viewed and (3) the value of the '''ENTITY''' attribute from the &lt;FIGURE&gt; tag accompanying the &lt;PB&gt; tag in the XML.
+==[[Working with the Text Class User Interface]]==
-The <tt>PageviewFig</tt> table can be populated by the <tt>DLXSROOT/bin/t/text/processPageviewFig.pl</tt> Perl script. It accepts a single command-line option with values of '''-a''' to insert rows and '''-d''' to delete rows. The script will prompt for DLXSROOT and for a path relative to DLXSROOT where the XPAT Data Dictionary for the collection resides, e.g. <tt>/idx/c/coll/coll.dd</tt>. The script will perform XPAT queries and parse the '''ENTITY''' attribute value from the &lt;FIGURE&gt; tag and the '''SEQ''' attribute of the containing &lt;PB&gt; tag and the <tt>idno</tt> of the containing item. The result is inserted into the table. Deletion is accomplished by deleting all rows with <tt>idno</tt> values for the collection in question.
+==[[Creating Text Class Wordwheels]]==
-=== Putting it all together ===
+==[[Newspaper Clip Image Access Mechanisms]]==
-When Pageviewer displays the image for a given page ('''seq''') for a given item ('''idno''') in some collection it uses these keys to look up the corresponding '''ENTITY''' value in the PageviewFig table. It constructs two IDs from the '''ENTITY''' for lookup in the <tt>idresolver</tt> table. It then builds links using the URLs to ImageClass that come back in the query result. If the thumbnail image URL is empty it builds a text link instead of a thumb link.
+==[[Workflows for delivering PDF files as page images through pageturner]]==
-==[[Working with the User Interface]]==
+==[[Picklist customization for serial articles]]==
-===[[Specifying Individual Collection Characteristics]]===
+<span class="release_15">'''Release_15'''</span>
-Could we re-label this topic with a friendlier title? Perhaps "Customizing Text Class UI Layout"or something.
-===[[Text Class XML Template Files]]===
-===[[Text Class Processing Instructions]]===
-===[[Text Class Graphics Files]]===
 ==[[Text Class Troubleshooting]]==
+This section should include Text Class problems and solutions--user-contributed material is welcome!
 [[#top|Top]]