OAI Transform

From DLXS Documentation

(Difference between revisions)
Jump to: navigation, search
m
Current revision (12:59, 12 August 2008) (edit) (undo)
(Transform engine (OAITransform))
 
(8 intermediate revisions not shown.)
Line 1: Line 1:
-
[[DLXS Wiki|Main Page]] > [[Ancillary Resources]] > OAITranform Data Conditioning
+
[[DLXS Wiki|Main Page]] > [[Ancillary Resources]] > OAI Transform
-
==Transcoding from Latin-1 to UTF-8==
+
==Overview==
 +
 
 +
This document describes how to transform and normalize harvested records into BibClass. It also describes conditioning that takes place to transcode from Latin-1 to UTF-8, and to handle ISO Character Entity Reference Mapping, Numeric Character Reference Mapping, and miscellaneous character problems.
 +
 
 +
==Transform engine (OAITransform)==
 +
 
 +
OAITransform creates a concatenated BibClass file of all oai_dc records, per repository harvested. To start the transform tool use <code>./oaitransform/OAITransform [repository_id]</code> from within <code>$DLXSROOT/bin/o/oaister/oaitransform/</code>
 +
 
 +
Add the repository id you want to transform. This id is taken from <code>repository_table.txt</code>, which you will build using <code>repository_table.sample.txt</code> as your starting point, e.g.,
 +
 
 +
<pre>
 +
./oaitransform/OAITransform celebration
 +
</pre>
 +
 
 +
The transform program will process your oai_dc harvested files, first by concatenating them into raw files and then by transforming them into BibClass files. The <code>$DLXSROOT/bin/o/oaister/oaitransform/oai-BibClass3.xsl</code> file is used to perform the mapping from oai_dc to BibClass.
 +
 
 +
The repository report at the end of the transform will provide a number of statistics.
 +
 
 +
<pre>
 +
Repository Report: bristol
 +
        records with URLs      = 2582
 +
        records without URLs    = 0
 +
        skipped records        = 0
 +
        deleted records        = 83
 +
        repository records      = 2665
 +
        success rate            = 96.88%
 +
        ------------------------
 +
        data conditioning msgs? = YES!
 +
        normalization errors    = 2
 +
        raw parse failures      = 0
 +
        valid XML result        = Yes
 +
        HTML removed            = Yes
 +
</pre>
 +
 
 +
* records with URLs: OAIster is only interested in oai_dc records with a dc:identifier beginning with http or ftp, so the transform engines only transforms those records with those dc:identifiers.
 +
* records without URLs: The remainder of the records.
 +
* skipped records: Records skipped through the use of <code>$DLXSROOT/bin/o/oaister/oaitransform/OAITransFixer.pm</code>, which contains perl sub scripts designed to filter out or fix records from particular repositories.
 +
* deleted records: Count of the number of persistent deleted records from the repository.
 +
* repository records: All the oai_dc records harvested.
 +
* success rate: Percentage of records transformed from the total.
 +
* data conditioning msgs: If there are character errors during transformation, these are fixed by OAITransform. To see the character errors that have been fixed after a transformation, see <code>$DLXSROOT/bin/o/oaister/errors/utf8_status_log.txt</code>
 +
* normalization errors: The transform tool uses the <code>$DLXSROOT/bin/o/oaister/oaitransform/normal_types.txt</code> file to normalize the dc:type field values into five distinct BibClass TYPE values, i.e., text, image, audio, video, dataset. If the values in the dc:type fields can't be normalized because there are not mappings for them, these will be logged to an error file located at <code>$DLXSROOT/bin/o/oaister/errors/normalization_errors.txt</code>
 +
* raw parse failures: If there are encoding errors which the transform tool cannot fix, these will be indicated during the transform.
 +
* valid XML result: BibClass is now validated during transform.
 +
* HTML removed: Embedded HTML is now removed from K, AA, SU and X fields. To run transform without this fix, use the -s flag.
 +
 
 +
For questions on how to transform MODS records, please contact Kat Hagedorn at khage at umich dot edu.
 +
 
 +
==Data Conditioning==
 +
 
 +
===Transcoding from Latin-1 to UTF-8===
If the character encoding of the harvested data does not pass the Perl Encode::is_utf8 test, the data is assumed to be Latin-1 encoded and is transcoded from Latin-1 (iso-8859-1) to UTF-8. This heuristic is useful in the vast majority of cases. It is important to note it is impossible to determine the encoding if it is not specified or adherence to OAI specifications has lapsed.
If the character encoding of the harvested data does not pass the Perl Encode::is_utf8 test, the data is assumed to be Latin-1 encoded and is transcoded from Latin-1 (iso-8859-1) to UTF-8. This heuristic is useful in the vast majority of cases. It is important to note it is impossible to determine the encoding if it is not specified or adherence to OAI specifications has lapsed.
-
==ISO Character Entity Reference (CER) Mapping==
+
===ISO Character Entity Reference (CER) Mapping===
With the exception of the five reserved XML CERs (lt, gt, apos, quot, amp), the ISO Character Entity References from ISOAMSA, ISOAMSB, ISOAMSC ISOAMSN, ISOAMSO, ISOAMSR, ISOCYR1, ISOCYR2, ISOGRK1, ISOGRK2, ISOGRK3, ISOGRK4, ISOLAT1, ISOLAT2, ISOMFRK, ISONUM, ISOPUB, ISOTECH, MMLALIAS, and MMLEXTRA are translated into their corresponding UTF-8 encoded Unicode characters. It is usually an error for CERs from these sets to appear in the XML because they require an internal subset declaration to make the XML valid. This also improves searchability of the records and decreases file size.
With the exception of the five reserved XML CERs (lt, gt, apos, quot, amp), the ISO Character Entity References from ISOAMSA, ISOAMSB, ISOAMSC ISOAMSN, ISOAMSO, ISOAMSR, ISOCYR1, ISOCYR2, ISOGRK1, ISOGRK2, ISOGRK3, ISOGRK4, ISOLAT1, ISOLAT2, ISOMFRK, ISONUM, ISOPUB, ISOTECH, MMLALIAS, and MMLEXTRA are translated into their corresponding UTF-8 encoded Unicode characters. It is usually an error for CERs from these sets to appear in the XML because they require an internal subset declaration to make the XML valid. This also improves searchability of the records and decreases file size.
-
==Numeric Character Reference (NCR) Mapping==
+
===Numeric Character Reference (NCR) Mapping===
Numeric Character References of the form <code>&#xXXXX;</code> where XXXX represents a hexadecimal number of one to 4 hexadecimal digits and <code>&#YYYY;</code> where YYYY represents a decimal number of one to 4 decimal digits are mapped to their equivalent UTF-8 encoded Unicode characters. This is not strictly necessary but improves searchability. The five reserved XML characters, if represented by NCRs, are not mapped.
Numeric Character References of the form <code>&#xXXXX;</code> where XXXX represents a hexadecimal number of one to 4 hexadecimal digits and <code>&#YYYY;</code> where YYYY represents a decimal number of one to 4 decimal digits are mapped to their equivalent UTF-8 encoded Unicode characters. This is not strictly necessary but improves searchability. The five reserved XML characters, if represented by NCRs, are not mapped.
-
==Miscellaneous Character Problems==
+
===Miscellaneous Character Problems===
* Some URLs within dc:identifier erroneously have naked &-separated parameters. Naked '&' characters are converted to '&amp;amp;' in this event but the reserved XML character entity references are protected. '&' characters erroneously followed by whitespace are converted to '&amp;amp;'.
* Some URLs within dc:identifier erroneously have naked &-separated parameters. Naked '&' characters are converted to '&amp;amp;' in this event but the reserved XML character entity references are protected. '&' characters erroneously followed by whitespace are converted to '&amp;amp;'.
Line 26: Line 76:
<pre>
<pre>
-
 
     "Legal characters are tab, carriage return, line feed, and the legal characters of Unicode  
     "Legal characters are tab, carriage return, line feed, and the legal characters of Unicode  
     and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were  
     and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were  
Line 37: Line 86:
* For this purpose we use a modified version of the  [http://www.cs.cornell.edu/people/simeon/software/utf8conditioner/ utf8conditioner] written by Simeon Warner at Cornell University (simeon at cs dot cornell dot edu).
* For this purpose we use a modified version of the  [http://www.cs.cornell.edu/people/simeon/software/utf8conditioner/ utf8conditioner] written by Simeon Warner at Cornell University (simeon at cs dot cornell dot edu).
-
 
[[#top|Top]]
[[#top|Top]]

Current revision

Main Page > Ancillary Resources > OAI Transform

Contents

[edit] Overview

This document describes how to transform and normalize harvested records into BibClass. It also describes conditioning that takes place to transcode from Latin-1 to UTF-8, and to handle ISO Character Entity Reference Mapping, Numeric Character Reference Mapping, and miscellaneous character problems.

[edit] Transform engine (OAITransform)

OAITransform creates a concatenated BibClass file of all oai_dc records, per repository harvested. To start the transform tool use ./oaitransform/OAITransform [repository_id] from within $DLXSROOT/bin/o/oaister/oaitransform/

Add the repository id you want to transform. This id is taken from repository_table.txt, which you will build using repository_table.sample.txt as your starting point, e.g.,

	./oaitransform/OAITransform celebration

The transform program will process your oai_dc harvested files, first by concatenating them into raw files and then by transforming them into BibClass files. The $DLXSROOT/bin/o/oaister/oaitransform/oai-BibClass3.xsl file is used to perform the mapping from oai_dc to BibClass.

The repository report at the end of the transform will provide a number of statistics.

Repository Report: bristol
        records with URLs       = 2582
        records without URLs    = 0
        skipped records         = 0
        deleted records         = 83
        repository records      = 2665
        success rate            = 96.88%
        ------------------------
        data conditioning msgs? = YES!
        normalization errors    = 2
        raw parse failures      = 0
        valid XML result        = Yes
        HTML removed            = Yes
  • records with URLs: OAIster is only interested in oai_dc records with a dc:identifier beginning with http or ftp, so the transform engines only transforms those records with those dc:identifiers.
  • records without URLs: The remainder of the records.
  • skipped records: Records skipped through the use of $DLXSROOT/bin/o/oaister/oaitransform/OAITransFixer.pm, which contains perl sub scripts designed to filter out or fix records from particular repositories.
  • deleted records: Count of the number of persistent deleted records from the repository.
  • repository records: All the oai_dc records harvested.
  • success rate: Percentage of records transformed from the total.
  • data conditioning msgs: If there are character errors during transformation, these are fixed by OAITransform. To see the character errors that have been fixed after a transformation, see $DLXSROOT/bin/o/oaister/errors/utf8_status_log.txt
  • normalization errors: The transform tool uses the $DLXSROOT/bin/o/oaister/oaitransform/normal_types.txt file to normalize the dc:type field values into five distinct BibClass TYPE values, i.e., text, image, audio, video, dataset. If the values in the dc:type fields can't be normalized because there are not mappings for them, these will be logged to an error file located at $DLXSROOT/bin/o/oaister/errors/normalization_errors.txt
  • raw parse failures: If there are encoding errors which the transform tool cannot fix, these will be indicated during the transform.
  • valid XML result: BibClass is now validated during transform.
  • HTML removed: Embedded HTML is now removed from K, AA, SU and X fields. To run transform without this fix, use the -s flag.

For questions on how to transform MODS records, please contact Kat Hagedorn at khage at umich dot edu.

[edit] Data Conditioning

[edit] Transcoding from Latin-1 to UTF-8

If the character encoding of the harvested data does not pass the Perl Encode::is_utf8 test, the data is assumed to be Latin-1 encoded and is transcoded from Latin-1 (iso-8859-1) to UTF-8. This heuristic is useful in the vast majority of cases. It is important to note it is impossible to determine the encoding if it is not specified or adherence to OAI specifications has lapsed.

[edit] ISO Character Entity Reference (CER) Mapping

With the exception of the five reserved XML CERs (lt, gt, apos, quot, amp), the ISO Character Entity References from ISOAMSA, ISOAMSB, ISOAMSC ISOAMSN, ISOAMSO, ISOAMSR, ISOCYR1, ISOCYR2, ISOGRK1, ISOGRK2, ISOGRK3, ISOGRK4, ISOLAT1, ISOLAT2, ISOMFRK, ISONUM, ISOPUB, ISOTECH, MMLALIAS, and MMLEXTRA are translated into their corresponding UTF-8 encoded Unicode characters. It is usually an error for CERs from these sets to appear in the XML because they require an internal subset declaration to make the XML valid. This also improves searchability of the records and decreases file size.

[edit] Numeric Character Reference (NCR) Mapping

Numeric Character References of the form &#xXXXX; where XXXX represents a hexadecimal number of one to 4 hexadecimal digits and &#YYYY; where YYYY represents a decimal number of one to 4 decimal digits are mapped to their equivalent UTF-8 encoded Unicode characters. This is not strictly necessary but improves searchability. The five reserved XML characters, if represented by NCRs, are not mapped.

[edit] Miscellaneous Character Problems

  • Some URLs within dc:identifier erroneously have naked &-separated parameters. Naked '&' characters are converted to '&amp;' in this event but the reserved XML character entity references are protected. '&' characters erroneously followed by whitespace are converted to '&amp;'.
  • Truncated decimal and hexadecimal NCRs (e.g. &#, &#x, &#x2C, etc.) are replaced with the characters following the # or #x portion of the truncated entity (mainly for the purposes of locating the string for a human editor). Entities truncated at the '&', if not followed by whitespace may look like (most likely invalid) CERs, if followed by a ';' within reasonable proximity to the '&'. These are trapped as invalid by the CER mapping.
  • UTF-8 encoded Unicode characters with codepoints higher than U+FFFF are converted to the canonical name string representation of the character as defined in UnicodeData.txt. Since NCRs above U+FFFF have previously been converted to UTF-8 encoded characters this trap handles all encoded forms of non-Plane Zero Unicode characters. This is done because XPAT does not support the indexing of codepoints higher that U+FFFF.
  • A variety of Windows-1252 characters are converted to their UTF-8 encoded equivalent. Windows-1252 uses the control character range above 0x7f to encode characters like smart quotes. These often inadvertently become part of OAI data sourced from Windows applications.
     "Legal characters are tab, carriage return, line feed, and the legal characters of Unicode 
     and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were 
     current at the time this document was prepared. New characters may be added to these standards 
     by amendments or new editions. Consequently, XML processors must accept any character in the 
     range specified for Char. The use of "compatibility characters", as defined in section 6.8 
     of Unicode (see also D21 in section 3.6 of Unicode3), is discouraged. Character Range Char
     ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]."
  • For this purpose we use a modified version of the utf8conditioner written by Simeon Warner at Cornell University (simeon at cs dot cornell dot edu).

Top

Personal tools