Data Conversion and Preparation

From DLXS Documentation

(Difference between revisions)
Jump to: navigation, search
(Converting Those Character Encodings to UTF-8)
(Using Unconverted Collections with Release 12a Middleware)
Line 232: Line 232:
* [http://www.jclark.com/sp/ James Clark's Original sp Documentation]
* [http://www.jclark.com/sp/ James Clark's Original sp Documentation]
-
==Using Unconverted Collections with Release 12a Middleware==
+
==Using Unconverted Collections with Current Middleware==
Recognizing that there may be situations when you do not wish to migrate a collection to XML and UTF-8 immediately, there are mechanisms built into Release 12a to allow the middleware to handle SGML-style empty elements (aka singletons), Latin 1 characters, and character entity references. How do you make this happen? In DlpsUtils.pm, there is a subroutine called Sgml2XmlFilter that has a hard-coded list of empty elements (<code><PB>, <LB>, <CAESURA></code>, etc.) that are converted upon discovery to XML-style (<code><PB/>, <LB/>, <CAESURA/></code>, etc.), and a feature that converts Latin 1 (ISO-8859-1) characters to UTF-8. This subroutine comes into play if the locale field in collmgr is not set to en_US.UTF-8 (locale used to be optional but is now required if you are using UTF-8 and xpatu). In order to declare your entities, you need to put a file called entitiesdoctype.chnk in the web directory for your collection, declaring the entities like so:
Recognizing that there may be situations when you do not wish to migrate a collection to XML and UTF-8 immediately, there are mechanisms built into Release 12a to allow the middleware to handle SGML-style empty elements (aka singletons), Latin 1 characters, and character entity references. How do you make this happen? In DlpsUtils.pm, there is a subroutine called Sgml2XmlFilter that has a hard-coded list of empty elements (<code><PB>, <LB>, <CAESURA></code>, etc.) that are converted upon discovery to XML-style (<code><PB/>, <LB/>, <CAESURA/></code>, etc.), and a feature that converts Latin 1 (ISO-8859-1) characters to UTF-8. This subroutine comes into play if the locale field in collmgr is not set to en_US.UTF-8 (locale used to be optional but is now required if you are using UTF-8 and xpatu). In order to declare your entities, you need to put a file called entitiesdoctype.chnk in the web directory for your collection, declaring the entities like so:

Revision as of 15:15, 31 July 2007

Contents

Converting and Preparing Your Data

For many collections, converting and preparing data is the most time-consuming and difficult part of mounting the collection online. Because each conversion project is specific to your material and cannot be easily generalized, DLXS does not formally support mechanisms for converting data to various formats. Nevertheless, we do provide some documentation on strategies, tools, and methods that we have found helpful for data conversion. Some of this documentation is class-specific, and some deals with more general Unicode and XML issues.

For Image Class and Bib Class, the class-specific links below provide fairly straightforward strategies and some tools for converting the fielded data typical in those classes.

In some cases, however, you must further modify your data to work effectively with DLXS. For example, converting to the Text Class DTD and inserting “nodes” in Text Class documents. (link) For Text Class, DLPS does not have any preferred methods or quick and easy tools for the conversion process. Only you, after looking at your texts and your encoding practices, can do the intellectual work required to convert the texts to support the necessary Text Class data structure (link?) You should do this with the tools you are most comfortable using, whether they are macros in your favorite editor, perl scripts in you have strong programming skills, OmniMark, or XSLT if your source files are currently or can be converted to XML. The Text Class documentation linked below describes a fairly detailed XSLT strategy, using freely-available or ubiquitous tools.


For non-unicode specific information on data preparation for individual classes, see the following: [also edit in Working with Unicode]

  • Preparing Text Class Data for Index Building / Converting Collections to Text Class
  • Image Class—where is this section? / Image Class Data Loading: My SQL
  • Transforming Bibliographic Class Files
  • Mounting a Finding Aids Collection: Preparing Data and Directories

For other encoding information, see the following:

Unicode, XML, and Normalization

To make the most of Text Class and Finding Aids Class in DLXS Release 12 and up, you will want to convert or otherwise handle the character entities, numeric entities, or Latin 18-bit characters that have been the staples of SGML (and XML, despite the default encoding of UTF-8) for so long. This is separate from the conversion of Text Class materials to the Text Class DTD; even with finding aids that are already in XML, you will probably need to do some testing of character encodings, conversion of these encodings to UTF-8, normalization, and conversion of SGML to XML (strange but true).

For more Unicode-specific information, see Working with Unicode.


Determining the Character Encodings Present in Your Data

There are a number of possibilities you may encounter:

  1. Plain ASCII (aka the Basic Latin block)
  2. Character entity references (ISO and otherwise)
  3. Numeric character references (decimal and/or hexadecimal)
  4. Latin 1 characters
  5. UTF-8 characters

You may very well find a mixture of 1, 2, 3, and 4 or even 2, 3, and 5 in the wild, simply because many encoders are not clear on what they should be doing with special characters. One hopes you will not encounter a document with a mixture of Latin 1 and UTF-8 characters, although it's possible that misidentified files could end up concatenated together and create such a mess. There are a number of tools you can use to identify what you have before you:

findentities.pl A perl script written by Phil that is part of the DLXS package, it prints the names and frequencies of the entities (CERs and NCRs) it encounters. Fairly quick, regardless of the size of the file. Can be run on more than one file at once, which is handy if you have a batch of texts.

xpatutf8check Another perl script written by Phil, it exists to answer the question, "Will xpatu index this?" It will report the line number of the first non-UTF character it encounters when it has failed and it runs very quickly, so it's great as a first step in checking your material, but it is not authoritative enough to identify all of the problems you may have.

jhove The JSTOR/Harvard Object Validation Environment has a UTF-8 module that reports whether your document is or is not valid UTF-8, and which Unicode blocks are contained in the document. Can be slow checking large documents, but very informative. Available at http://hul.harvard.edu/jhove/ and invoked with jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul file.xml

utf8chars Yet another perl script written by Phil, it identifies the characters used in a document and the Unicode blocks to which they belong. It assumes your document is UTF-8 and will report each instance (by line number) where a non-UTF character is encountered. Because it is identifying and counting each character in a document, it is rather slow, but very very useful. Runs on one file at a time and prints to standard out, but can be invoked through a foreach to check many files in one command.

Converting Those Character Encodings to UTF-8

If you have a mixed bag of encodings and entities in your documents, there's a definite order in which you want to approach the conversion task, to avoid having a mixture of Latin 1 and UTF-8 in one document at any point in the transformation.

  1. First, if you have Latin 1 characters like â, run iconv, part of the Gnu C library, to convert files from one encoding to another.

    iconv -f iso88591 -t utf8 oldfile > newfile

  2. Next, convert character entity references like &acirc; using isocer2utf8, a perl script written by Phil to convert character entity references to UTF-8 characters. Although it references ISO in the name, it's been expanded to handle all the CERs we've encountered, including TEI Greek and the Chadwyck-Healey custom entities.

    /l1/bin/t/text/isocer2utf8 oldfile > newfile

  3. Finally, if you have numeric character references like â or &amp#xE2;, run ncr2utf8, also written by Phil, to convert decimal and hexadecimal entities to UTF-8 characters.

    /l1/bin/t/text/ncr2utf8 oldfile > newfile</p>

This would be a good point to run findentities.pl again to see what (if anything) you have left, and to re-validate using jhove or utf8chars to ensure that you have done no harm.

Workshop Materials: Test Driving the Tools

In the directory /l1/workshop-samples/sooty, you will find four sample files that we'll examine for character encoding and then convert to UTF-8. Copy these to your own directory -- they are completely expendable and won't serve a purpose in tomorrow's Text Class implementation. They are merely illustrative of all the possibilities you might encounter and how you may want to handle them. First, we'll look at which character or numeric entities, if any, are used in these documents.

   foreach file (findaid*)
   echo $file 
   $DLXSROOT/bin/t/text/findEntities.pl $file 
   end

   foreach file (text*)
   echo $file 
   $DLXSROOT/bin/t/text/findEntities.pl $file 
   end

Since most of you are set up for bash, here are the same commands in that shell:

   for file in findaid*
   do
   echo $file 
   $DLXSROOT/bin/t/text/findEntities.pl $file
   done

   for file in text*
   do
   echo $file 
   $DLXSROOT/bin/t/text/findEntities.pl $file
   done

We have some CERs and NCRs to deal with, aside from the five XML-approved ent</code> and ncr2utf. Next, we'll see what characters we have (Latin 1? UTF-8? something else?). We'll run through all three tools, just for the sake of completeness, in the order of speediness and terseness.

   foreach file (findaid*)
   echo $file 
   xpatutf8check $file 
   end
   foreach file (text*)
   echo $file 
   xpatutf8check  $file 
   end

Since most of you are set up for bash, here are the same commands in that shell:

   for file in findaid*
   do
   echo $file 
   xpatutf8check  $file
   done
   for file in text*
   do
   echo $file 
   xpatutf8check  $file
   done

We now know that both the text files are either UTF-8 or plain ASCII (because of the output of these two tests), but there's a problem with one of the finding aids. jhove will tell us a bit more about our materials. You'll note we don't need to echo the filename as that's part of the jhove report. You'll also notice jhove is not so fast.

   foreach file (findaid*)
   jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul  $file 
   end
   foreach file (text*)
   jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul  $file 
   end

If you are a bash aficionado, here are the same commands in that shell:

   for file in findaid*
   do
   jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul  $file
   done
   for file in text*
   do
   jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul  $file
   done

So, the second file in each set is plain ASCII (the Basic Latin block) with entities, the first finding aid is not UTF-8, and the first text file is. Let's look a bit more at the two non-ASCII files with the slowest and most verbose tool of them all. We're not doing a foreach this time, but we wouldn't need to echo the filename either, as it is again part of what the tool reports.

   utf8chars findaid1.xml
   utf8chars text1.xml

We can see the exact problem with findaid1.xml -- there's an 8-bit Latin 1 e acute before Boekeloo on line 37. We also can see all the UTF-8 characters in text1.xml -- this is the sort of information that is useful when time comes to map characters and encodings in the xpatu data dictionary.

Now that we know which items need what character treatments, we'll convert them. text1.xml is completely fine, so we'll leave it as is. findaid1.xml has the one Latin 1 character, so we'll use iconv to convert it to UTF-8. It had no entities of any kind, so we'll be done with it after this step.

   iconv -f iso88591 -t utf8 findaid1.xml > findaid1.xml.utf

Next, findaid2.xml had numeric character references. It is fine and can be indexed as-is, but users would need to search for the hexadecimal string in the midst of words ( é for é, for example). So, we'll use ncr2utf to convert the entities into the characters. WARNING! & is the ampersand (as is &) -- if you convert these to the character, you will run into validation problems down the road, as bare ampersands are not permitted in XML. Don't get carried away!

   ncr2utf8 findaid2.xml > findaid2.xml.utf

Finally, text2.sgm has ISO character entity references (from Latin 1, Greek, and Publishing) that need to be converted to UTF-8 with isocer2utf.

   isocer2utf8 text2.sgm > text2.sgm.utf

Note that the ampersand CER was not processed. This is perfectly correct.

More Documentation

Normalization and Converting SGML to XML

Many of you may be in a position where you'll want to be converting your SGML files to XML. Many of you will be fortunate enough to have files already in XML -- say, finding aids in EAD 2002. However, these will have to be normalized, too, to avoid problems with xpatu and xmlrgn down the road by ensuring that all the attributes are in the same order as specified in the DTD. Because of known but uncorrected problems in the normalization tools, you will end up with SGML and will need to convert that to XML.

Because the file we want to work with is now UTF-8, we need to set some environment variables for the tools from the sp package to let them know this is UTF-8. It doesn't matter that you've set your puTTy window to UTF-8, if you are using osx, osgmlnorm, or onsgmls, you must set your environment properly. For example:

   setenv SP_CHARSET_FIXED YES
   setenv SP_ENCODING utf-8

For those of you in bash, it's

   export SP_CHARSET_FIXED=YES
   export SP_ENCODING=utf-8

First we normalize, invoking a declaration to handle the non-SGML UTF-8 characters without claiming that the material itself is XML.

  osgmlnorm $DLXSROOT/misc/sgml/xmlentities.dcl sample.inp text2.sgm.utf > text2.sgm.norm

Now I'll test the output with one of the UTF-8 tools to make sure that it's come through unscathed, and with findEntities.pl to see what has happened with the remaining XML-friendly entities, and it's fine. Now to convert our SGML to XML using osx.

   osx -x no-nl-in-tag -x empty -E 500 -f errors $DLXSROOT/misc/sgml/xmlentities.dcl sample.inp          text2.sgm.norm > text2.xml

Again I'll test the output with one of the UTF-8 tools to make sure that it's come through unscathed, and with findEntities.pl to see what has happened with the remaining XML-friendly entities, and again it's fine.

Just for fun, we'll normalize the files already in XML, just to show that things get changes from XML to SGML against their will.

   osgmlnorm $DLXSROOT/misc/sgml/xml.dcl $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.text.inp   text1.xml > text1.xml.norm
   osx -x no-nl-in-tag -x empty -E 5000 -f error $DLXSROOT/misc/sgml/xml.dcl     $DLXSROOT/prep/s/sampletc_utf8/sampletc_utf8.text.inp  text1.xml.norm > text1.xml.norm.xml

More Documentation

Using Unconverted Collections with Current Middleware

Recognizing that there may be situations when you do not wish to migrate a collection to XML and UTF-8 immediately, there are mechanisms built into Release 12a to allow the middleware to handle SGML-style empty elements (aka singletons), Latin 1 characters, and character entity references. How do you make this happen? In DlpsUtils.pm, there is a subroutine called Sgml2XmlFilter that has a hard-coded list of empty elements (<PB>, <LB>, <CAESURA>, etc.) that are converted upon discovery to XML-style (<PB/>, <LB/>, <CAESURA/>, etc.), and a feature that converts Latin 1 (ISO-8859-1) characters to UTF-8. This subroutine comes into play if the locale field in collmgr is not set to en_US.UTF-8 (locale used to be optional but is now required if you are using UTF-8 and xpatu). In order to declare your entities, you need to put a file called entitiesdoctype.chnk in the web directory for your collection, declaring the entities like so:

   <!DOCTYPE Top [
   <!ENTITY Dstrok   "Đ">
   <!ENTITY Sacute   "Ś">
   <!ENTITY Scaron   "Š">
   <!ENTITY Ubreve   "Ŭ">
   <!ENTITY Zdot     "Ż">
   ]>

That being the case, why would anyone ever bother to go through the trouble of converting their material? First off, the value of having UTF-8 is apparent if you have material that used more than one entity set (and even the lowliest collections have both an e acute and an em-dash in them somewhere). Now that — is one character that can be mapped to a space in the data dictionary like other punctuation, phrases that were obscured in searches now turn up, and characters that we used to flatten (transforming ā to a, for example) can be displayed. Second, this facility comes at a cost. All of the material returned needs to be run through this filter, which will take some time. In a results list, the lag is negligable, but in larger sections of text, it could be noticeable. Finally, some confusion might arise when a user cuts and pastes material he received as a result and cannot retrieve it again, because the results and input are UTF-8 (which is the encoding of the search form) but the material being searched is not. Insert non-formatted text here

Personal tools