Workflows for delivering PDF files as page images through pageturner

From DLXS Documentation

Jump to: navigation, search

DLXS uses the encoding levels of "TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices: Version 1.0", allowing Level-1, Level-2, or Level-4 content. Here are examples of each:

Level 1 and Level 2 are for page-image content.

(Note that the latest stable version of the best practices document is version 2.1, from which you'll find a link to the draft of 3.0, still under development.)

If you're working from PDF source, you'll need to split the PDF into TIFF images. OCR software often requires bitonal TIFF images, so you might not bother selecting pages of the PDF to save as grayscale or contone as well even though this would lead to a better appearance for these pages.

You can split into TIFFs at the command line using Ghostscript. Since Ghostscript sometimes chokes on PDFs, so you might fall back to pdftoppm followed by convert. This option uses more processor and storage space. If pdftoppm/convert also doesn't work, you can use Automator in Mac OS X to extract PDFs and then GraphicConverter (for Mac) to make them bitonal. While Acrobat lets you produce TIFFs from a PDF file, it can't produce bitonal TIFFs.

You can run these bitonal images through OCR software and insert the required Level-1 markup (mostly PB elements) at the appropriate places based on your OCR output. However, you will also need a way of producing pagetagging for the pageturner. Michigan uses PageTag for this, and the OCR operator takes data from the pagetag data and inserts it into the XML output of the OCR process and produces a pageview.dat file. That is, the OCR process is able to produce all attributes of PB except FTR= and N=, and the value of these attributes come from the pagetagging process. See Working with Page Image Access Mechanisms in Text Class.

However, if your PDFs are vector PDFs or raster PDFs that have been run through an OCR process that embeds the OCR text in them, you might instead want to generate your XML by extracting text from the the PDF file. While there are command-line utilities for extracting the text, Acrobat gives cleaner, more usable output. So while there won't be OCR errors, you will almost certainly have residue of the text extraction process, some of which depend on how the PDF was produced: missing spaces at line breaks, messed up hyphenation, and -- worst of all -- text from multiple columns running together as if they were meant to be read across the page, not in a column. So it might not be worth the hassle.

In case you do go this route, Acrobat Reader offers text extraction under File > Save as Text..., whereas Acrobat Professional gives the same functionality under File > Save As..., where you choose "Text (Accessible)" as the format. I have scripts -- also available upon request -- that take this output, insert PB tags at the page breaks (marked by a nondisplaying character, as John said), and insert values for FTR= and N= from pagetagging data.

If you're going to use Acrobat Professional's built-in OCR capability (that will insert hidden text behind the raster image), then, having opened files in Acrobat anyway, you might as well extract the text at that point.

Here's an example that shows how we make bitonal page images of every page (even those that should be done as grayscale or contone). If you switch to text view, you'll see the text extracted from the PDF rather than generated by OCR:

If you want to make Level-2 content, you'll need to insert the relevant chunks of markup in the OCR output at the appropriate places. You can use macros in a text editor to paste together a series of these chunks in an empty text file, which is then saved with an IDNO corresponding to the OCR'd item. These bits of XML together can be tested for well-formedness (for obvious errors), and if page numbers are tied to each chunk, a script can insert these at the appropriate point in relation to PB elements in the corresponding OCR.

Personal tools