OAI Transform

From DLXS Documentation

Revision as of 15:39, 4 December 2007 by Khage (Talk | contribs)

(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)

Main Page > Ancillary Resources > OAITranform Data Conditioning

1 Overview
2 Transform engine (OAITransform)
3 Data Conditioning

Overview

This document describes how to transform and normalize harvested records into BibClass. It also describes conditioning that takes place to transcode from Latin-1 to UTF-8, and to handle ISO Character Entity Reference Mapping, Numeric Character Reference Mapping, and miscellaneous character problems.

Transform engine (OAITransform)

OAITransform creates a concatenated BibClass file of all oai_dc records, per repository harvested. To start the transform tool use ./oaitransform/OAITransform [repository_id] from within $DLXSROOT/bin/o/oaister/oaitransform/

Add the repository id you want to transform. This id is taken from repository_table.txt, which you will build using repository_table.sample.txt as your starting point, e.g.,

	./oaitransform/OAITransform celebration

The transform program will process your oai_dc harvested files, first by concatenating them into raw files and then by transforming them into BibClass files. The $DLXSROOT/bin/o/oaister/oaitransform/oai-BibClass3.xsl file is used to perform the mapping from oai_dc to BibClass.

The repository report at the end of the transform will provide a number of statistics.

Repository Report: bristol
        records with URLs       = 818
        records without URLs    = 5
        repository records      = 823
        success rate            = 99.39%
        ------------------------
        data conditioning msgs? = YES!
        deleted records (.del)  = 0
        normalization errors    = 2
        raw parse failures      = 0
        valid XML result        = Yes
        HTML removed            = Yes

records with URLs: OAIster is only interested in oai_dc records with a dc:identifier beginning with http or ftp, so the transform engines only transforms those records with those dc:identifiers.
records without URLs: The remainder of the records.
repository records: All the oai_dc records harvested.
data condtioning msgs: If there are character errors during transformation, these are fixed by OAITransform. To see the character errors that have been fixed after a transformation, see $DLXSROOT/bin/o/oaister/errors/utf8_status_log.txt
deleted records (.del): Not used unless you want to re-write the harvester to mark deleted records with a .del extension.
normalization errors: The transform tool uses the $DLXSROOT/bin/o/oaister/oaitransform/normal_types.txt file to normalize the dc:type field values into five distinct BibClass TYPE values, i.e., text, image, audio, video, dataset. If the values in the dc:type fields can't be normalized because there are not mappings for them, these will be logged to an error file located at $DLXSROOT/bin/o/oaister/errors/normalization_errors.txt
raw parse failures: If there are encoding errors which the transform tool cannot fix, these will be indicated during the transform.
valid XML result: BibClass is now validated during transform.
HTML removed: Embedded HTML is now removed from K, AA, SU and X fields. To run transform without this fix, use the -s flag.

For questions on how to transform MODS records, please contact Kat Hagedorn at khage at umich dot edu.

Data Conditioning

Transcoding from Latin-1 to UTF-8

If the character encoding of the harvested data does not pass the Perl Encode::is_utf8 test, the data is assumed to be Latin-1 encoded and is transcoded from Latin-1 (iso-8859-1) to UTF-8. This heuristic is useful in the vast majority of cases. It is important to note it is impossible to determine the encoding if it is not specified or adherence to OAI specifications has lapsed.

ISO Character Entity Reference (CER) Mapping

With the exception of the five reserved XML CERs (lt, gt, apos, quot, amp), the ISO Character Entity References from ISOAMSA, ISOAMSB, ISOAMSC ISOAMSN, ISOAMSO, ISOAMSR, ISOCYR1, ISOCYR2, ISOGRK1, ISOGRK2, ISOGRK3, ISOGRK4, ISOLAT1, ISOLAT2, ISOMFRK, ISONUM, ISOPUB, ISOTECH, MMLALIAS, and MMLEXTRA are translated into their corresponding UTF-8 encoded Unicode characters. It is usually an error for CERs from these sets to appear in the XML because they require an internal subset declaration to make the XML valid. This also improves searchability of the records and decreases file size.

Numeric Character Reference (NCR) Mapping

Numeric Character References of the form &#xXXXX; where XXXX represents a hexadecimal number of one to 4 hexadecimal digits and &#YYYY; where YYYY represents a decimal number of one to 4 decimal digits are mapped to their equivalent UTF-8 encoded Unicode characters. This is not strictly necessary but improves searchability. The five reserved XML characters, if represented by NCRs, are not mapped.

Miscellaneous Character Problems

Some URLs within dc:identifier erroneously have naked &-separated parameters. Naked '&' characters are converted to '&' in this event but the reserved XML character entity references are protected. '&' characters erroneously followed by whitespace are converted to '&'.

Truncated decimal and hexadecimal NCRs (e.g. &#, &#x, &#x2C, etc.) are replaced with the characters following the # or #x portion of the truncated entity (mainly for the purposes of locating the string for a human editor). Entities truncated at the '&', if not followed by whitespace may look like (most likely invalid) CERs, if followed by a ';' within reasonable proximity to the '&'. These are trapped as invalid by the CER mapping.

UTF-8 encoded Unicode characters with codepoints higher than U+FFFF are converted to the canonical name string representation of the character as defined in UnicodeData.txt. Since NCRs above U+FFFF have previously been converted to UTF-8 encoded characters this trap handles all encoded forms of non-Plane Zero Unicode characters. This is done because XPAT does not support the indexing of codepoints higher that U+FFFF.

A variety of Windows-1252 characters are converted to their UTF-8 encoded equivalent. Windows-1252 uses the control character range above 0x7f to encode characters like smart quotes. These often inadvertently become part of OAI data sourced from Windows applications.

Illegal characters per the XML 1.0 standard are replaced by the '?' character. From http://www.w3.org/TR/2000/REC-xml-20001006 (sec 2.2, extracted 16July2001):

     "Legal characters are tab, carriage return, line feed, and the legal characters of Unicode 
     and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were 
     current at the time this document was prepared. New characters may be added to these standards 
     by amendments or new editions. Consequently, XML processors must accept any character in the 
     range specified for Char. The use of "compatibility characters", as defined in section 6.8 
     of Unicode (see also D21 in section 3.6 of Unicode3), is discouraged. Character Range Char
     ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]."