Data Conversion and Preparation

From DLXS Documentation

(Difference between revisions)

Revision as of 16:55, 18 July 2007

Data Conversion and Preparation

For many collections, converting and preparing data is the most time-consuming difficult part of mounting the collection online. Because each conversion project is specific to your material and cannot be easily generalized, DLXS does not formally support mechanisms for converting data to various formats. Nevertheless, we do provide some documentation on strategies, tools, and methods that we have found helpful for data conversion. Some of this documentation is class-specific, and some deals with more general Unicode and XML issues.

For Image Class and Bib Class, the class-specific links below provide fairly straightforward strategies and some tools for converting the fielded data typical in those classes.

In some cases, however, you must further modify your data to work effectively with DLXS. For example, inserting “nodes” in Text Class documents. (link) For Text Class, DLPS does not have any preferred methods or quick and easy tools for the conversion process. Only you, after looking at your texts and your encoding practices, can do the intellectual work required to convert the texts to support the necessary Text Class data structure (link?) You should do this with the tools you are most comfortable using, whether they are macros in your favorite editor, perl scripts in you have strong programming skills, OmniMark, or XSLT if your source files are currently or can be converted to XML. The Text Class documentation linked below describes a fairly detailed XSLT strategy, using freely-available or ubiquitous tools.

compare to & work in regular doc sections

For non-unicode specific information on data preparation for individual classes, see the following: [also edit in Working with Unicode]

Preparing Text Class Data for Index Building / Converting Collections to Text Class
Image Class—where is this section? / Image Class Data Loading: My SQL
Transforming Bibliographic Class Files [transforming?]
Preparing Findaid Class Data for Index Building

For general encoding information, see the following:

Unicode, XML, and Normalization

To make the most of Text Class and Finding Aids Class in DLXS Release 12a (& 13?), you will want to convert or otherwise handle the character entities, numeric entities, or Latin 18-bit characters that have been the staples of SGML (and XML, despite the default encoding of UTF-8) for so long. This is separate from the conversion of Text Class materials to the Text Class DTD; even with finding aids that are already in XML, you will probably need to do some testing of character encodings, conversion of these encodings to UTF-8, normalization, and conversion of SGML to XML (strange but true). This topic contains the following sections:

Determining the Character Encodings Present in Your Data
Converting to UTF-8

Determining the Character Encodings Present in Your Data

There are a number of possibilities you may encounter:

  1. Plain ASCII (aka the Basic Latin block)
  2. Character entity references (ISO and otherwise)
  3. Numeric character references (decimal and/or hexadecimal)
  4. Latin 1 characters
  5. UTF-8 characters

You may very well find a mixture of 1, 2, 3, and 4 or even 2, 3, and 5 in the wild, simply because many encoders are not clear on what they should be doing with special characters. One hopes you will not encounter a document with a mixture of Latin 1 and UTF-8 characters, although it's possible that misidentified files could end up concatenated together and create such a mess. There are a number of tools you can use to identify what you have before you:

findentities.pl A perl script written by Phil that is part of the DLXS package, it prints the names and frequencies of the entities (CERs and NCRs) it encounters. Fairly quick, regardless of the size of the file. Can be run on more than one file at once, which is handy if you have a batch of texts.

xpatutf8check Another perl script written by Phil, it exists to answer the question, "Will xpatu index this?" It will report the line number of the first non-UTF character it encounters when it has failed and it runs very quickly, so it's great as a first step in checking your material, but it is not authoritative enough to identify all of the problems you may have.

jhove The JSTOR/Harvard Object Validation Environment has a UTF-8 module that reports whether your document is or is not valid UTF-8, and which Unicode blocks are contained in the document. Can be slow checking large documents, but very informative. Available at http://hul.harvard.edu/jhove/ and invoked with jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul file.xml

utf8chars Yet another perl script written by Phil, it identifies the characters used in a document and the Unicode blocks to which they belong. It assumes your document is UTF-8 and will report each instance (by line number) where a non-UTF character is encountered. Because it is identifying and counting each character in a document, it is rather slow, but very very useful. Runs on one file at a time and prints to standard out, but can be invoked through a foreach to check many files in one command.

Converting Those Character Encodings to UTF-8

If you have a mixed bag of encodings and entities in your documents, there's a definite order in which you want to approach the conversion task, to avoid having a mixture of Latin 1 and UTF-8 in one document at any point in the transformation.

First, if you have Latin 1 characters like â, run iconv, part of the Gnu C library, to convert files from one encoding to another.

iconv -f iso88591 -t utf8 oldfile > newfile

Next, convert character entity references like â using isocer2utf8, a perl script written by Phil to convert character entity references to UTF-8 characters. Although it references ISO in the name, it's been expanded to handle all the CERs we've encountered, including TEI Greek and the Chadwyck-Healey custom entities.

/l1/bin/t/text/isocer2utf8 oldfile > newfile

Finally, if you have numeric character references like â or â, run ncr2utf8, also written by Phil, to convert decimal and hexadecimal entities to UTF-8 characters.

/l1/bin/t/text/ncr2utf8 oldfile > newfile This would be a good point to run findentities.pl again to see what (if anything) you have left, and to re-validate using jhove or utf8chars to ensure that you have done no harm.

Test Driving the Tools

Chris: what part of this do you think is helpful to publish, given the audience of people who did not attend the workshop? We could save as is with an explanation, or give different/more general explanations.

In the directory /l1/workshop-samples/sooty, you will find four sample files that we'll examine for character encoding and then convert to UTF-8. Copy these to your own directory -- they are completely expendable and won't serve a purpose in tomorrow's Text Class implementation. They are merely illustrative of all the possibilities you might encounter and how you may want to handle them. First, we'll look at which character or numeric entities, if any, are used in these documents. foreach file (findaid*) echo $file $DLXSROOT/bin/t/text/findEntities.pl $file end

foreach file (text*) echo $file $DLXSROOT/bin/t/text/findEntities.pl $file end

Since most of you are set up for bash, here are the same commands in that shell:

for file in findaid* do echo $file $DLXSROOT/bin/t/text/findEntities.pl $file done

for file in text* do echo $file $DLXSROOT/bin/t/text/findEntities.pl $file done

We have some CERs and NCRs to deal with, aside from the five XML-approved entities (&, >, <, ', and "). So, we know we'll be needing both isocer2utf and ncr2utf. Next, we'll see what characters we have (Latin 1? UTF-8? something else?). We'll run through all three tools, just for the sake of completeness, in the order of speediness and terseness. foreach file (findaid*) echo $file xpatutf8check $file end

foreach file (text*) echo $file xpatutf8check $file end

Since most of you are set up for bash, here are the same commands in that shell: for file in findaid* do echo $file xpatutf8check $file done

for file in text* do echo $file xpatutf8check $file done

We now know that both the text files are either UTF-8 or plain ASCII (because of the output of these two tests), but there's a problem with one of the finding aids. jhove will tell us a bit more about our materials. You'll note we don't need to echo the filename as that's part of the jhove report. You'll also notice jhove is not so fast.

foreach file (findaid*) jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul $file end

foreach file (text*) jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul $file end

If you are a bash aficionado, here are the same commands in that shell: for file in findaid* do jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul $file done

for file in text* do jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul $file done

So, the second file in each set is plain ASCII (the Basic Latin block) with entities, the first finding aid is not UTF-8, and the first text file is. Let's look a bit more at the two non-ASCII files with the slowest and most verbose tool of them all. We're not doing a foreach this time, but we wouldn't need to echo the filename either, as it is again part of what the tool reports. utf8chars findaid1.xml utf8chars text1.xml We can see the exact problem with findaid1.xml -- there's an 8-bit Latin 1 e acute before Boekeloo on line 37. We also can see all the UTF-8 characters in text1.xml -- this is the sort of information that is useful when time comes to map characters and encodings in the xpatu data dictionary. Now that we know which items need what character treatments, we'll convert them. text1.xml is completely fine, so we'll leave it as is. findaid1.xml has the one Latin 1 character, so we'll use iconv to convert it to UTF-8. It had no entities of any kind, so we'll be done with it after this step. iconv -f iso88591 -t utf8 findaid1.xml > findaid1.xml.utf Next, findaid2.xml had numeric character references. It is fine and can be indexed as-is, but users would need to search for the hexadecimal string in the midst of words ( é for é, for example). So, we'll use ncr2utf to convert the entities into the characters. WARNING! & is the ampersand (as is &) -- if you convert these to the character, you will run into validation problems down the road, as bare ampersands are not permitted in XML. Don't get carried away! ncr2utf8 findaid2.xml > findaid2.xml.utf Finally, text2.sgm has ISO character entity references (from Latin 1, Greek, and Publishing) that need to be converted to UTF-8 with isocer2utf. isocer2utf8 text2.sgm > text2.sgm.utf Note that the ampersand CER was not processed. This is perfectly correct.

Data Conversion and Preparation

From DLXS Documentation

Revision as of 16:55, 18 July 2007

Contents

Unicode, XML, and Normalization

Determining the Character Encodings Present in Your Data

Converting Those Character Encodings to UTF-8

Test Driving the Tools

Views

Personal tools

Navigation

Search

Toolbox

@@ Line 24: / Line 24: @@
 To make the most of Text Class and Finding Aids Class in DLXS Release 12a (& 13?), you will want to convert or otherwise handle the character entities, numeric entities, or Latin 18-bit characters that have been the staples of SGML (and XML, despite the default encoding of UTF-8) for so long. This is separate from the conversion of Text Class materials to the Text Class DTD; even with finding aids that are already in XML, you will probably need to do some testing of character encodings, conversion of these encodings to UTF-8, normalization, and conversion of SGML to XML (strange but true).
 This topic contains the following sections:
-•	Determining the Character Encodings Present in Your Data
-•	Converting to UTF-8
+* Determining the Character Encodings Present in Your Data
-Determining the Character Encodings Present in Your Data
+* Converting to UTF-8
+===Determining the Character Encodings Present in Your Data===
 There are a number of possibilities you may encounter:
 . Plain ASCII (aka the Basic Latin block)
@@ Line 35: / Line 38: @@
 You may very well find a mixture of 1, 2, 3, and 4 or even 2, 3, and 5 in the wild, simply because many encoders are not clear on what they should be doing with special characters. One hopes you will not encounter a document with a mixture of Latin 1 and UTF-8 characters, although it's possible that misidentified files could end up concatenated together and create such a mess.
 There are a number of tools you can use to identify what you have before you:
-findentities.pl
+'''findentities.pl'''
 A perl script written by Phil that is part of the DLXS package, it prints the names and frequencies of the entities (CERs and NCRs) it encounters. Fairly quick, regardless of the size of the file. Can be run on more than one file at once, which is handy if you have a batch of texts.
-xpatutf8check
+'''xpatutf8check'''
 Another perl script written by Phil, it exists to answer the question, "Will xpatu index this?" It will report the line number of the first non-UTF character it encounters when it has failed and it runs very quickly, so it's great as a first step in checking your material, but it is not authoritative enough to identify all of the problems you may have.
-jhove
+'''jhove'''
 The JSTOR/Harvard Object Validation Environment has a UTF-8 module that reports whether your document is or is not valid UTF-8, and which Unicode blocks are contained in the document. Can be slow checking large documents, but very informative. Available at http://hul.harvard.edu/jhove/ and invoked with
 jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul file.xml
-utf8chars
+'''utf8chars'''
 Yet another perl script written by Phil, it identifies the characters used in a document and the Unicode blocks to which they belong. It assumes your document is UTF-8 and will report each instance (by line number) where a non-UTF character is encountered. Because it is identifying and counting each character in a document, it is rather slow, but very very useful. Runs on one file at a time and prints to standard out, but can be invoked through a foreach to check many files in one command.
-Converting Those Character Encodings to UTF-8
+===Converting Those Character Encodings to UTF-8===
 If you have a mixed bag of encodings and entities in your documents, there's a definite order in which you want to approach the conversion task, to avoid having a mixture of Latin 1 and UTF-8 in one document at any point in the transformation.
-.	First, if you have Latin 1 characters like â, run iconv, part of the Gnu C library, to convert files from one encoding to another.
-	iconv -f iso88591 -t utf8 oldfile > newfile
-.	Next, convert character entity references like &acirc; using isocer2utf8, a perl script written by Phil to convert character entity references to UTF-8 characters. Although it references ISO in the name, it's been expanded to handle all the CERs we've encountered, including TEI Greek and the Chadwyck-Healey custom entities.
+# First, if you have Latin 1 characters like â, run iconv, part of the Gnu C library, to convert files from one encoding to another.
-	/l1/bin/t/text/isocer2utf8 oldfile > newfile
+	<code>iconv -f iso88591 -t utf8 oldfile > newfile</code>
-.	Finally, if you have numeric character references like &#226; or &#xE2;, run ncr2utf8, also written by Phil, to convert decimal and hexadecimal entities to UTF-8 characters.
+# Next, convert character entity references like &acirc; using isocer2utf8, a perl script written by Phil to convert character entity references to UTF-8 characters. Although it references ISO in the name, it's been expanded to handle all the CERs we've encountered, including TEI Greek and the Chadwyck-Healey custom entities.
-	/l1/bin/t/text/ncr2utf8 oldfile > newfile
+	<code>/l1/bin/t/text/isocer2utf8 oldfile > newfile</code>
+#Finally, if you have numeric character references like &#226; or &#xE2;, run ncr2utf8, also written by Phil, to convert decimal and hexadecimal entities to UTF-8 characters.
+	<code>/l1/bin/t/text/ncr2utf8 oldfile > newfile</code>
 This would be a good point to run findentities.pl again to see what (if anything) you have left, and to re-validate using jhove or utf8chars to ensure that you have done no harm.
-Test Driving the Tools
+==Test Driving the Tools==
 Chris: what part of this do you think is helpful to publish, given the audience of people who did not attend the workshop? We could save as is with an explanation, or give different/more general explanations.
 In the directory /l1/workshop-samples/sooty, you will find four sample files that we'll examine for character encoding and then convert to UTF-8. Copy these to your own directory -- they are completely expendable and won't serve a purpose in tomorrow's Text Class implementation. They are merely illustrative of all the possibilities you might encounter and how you may want to handle them.
 First, we'll look at which character or numeric entities, if any, are used in these documents.