OAI Transform
From DLXS Documentation
m |
(→Transform engine (OAITransform)) |
||
(8 intermediate revisions not shown.) | |||
Line 1: | Line 1: | ||
- | [[DLXS Wiki|Main Page]] > [[Ancillary Resources]] > | + | [[DLXS Wiki|Main Page]] > [[Ancillary Resources]] > OAI Transform |
- | ==Transcoding from Latin-1 to UTF-8== | + | ==Overview== |
+ | |||
+ | This document describes how to transform and normalize harvested records into BibClass. It also describes conditioning that takes place to transcode from Latin-1 to UTF-8, and to handle ISO Character Entity Reference Mapping, Numeric Character Reference Mapping, and miscellaneous character problems. | ||
+ | |||
+ | ==Transform engine (OAITransform)== | ||
+ | |||
+ | OAITransform creates a concatenated BibClass file of all oai_dc records, per repository harvested. To start the transform tool use <code>./oaitransform/OAITransform [repository_id]</code> from within <code>$DLXSROOT/bin/o/oaister/oaitransform/</code> | ||
+ | |||
+ | Add the repository id you want to transform. This id is taken from <code>repository_table.txt</code>, which you will build using <code>repository_table.sample.txt</code> as your starting point, e.g., | ||
+ | |||
+ | <pre> | ||
+ | ./oaitransform/OAITransform celebration | ||
+ | </pre> | ||
+ | |||
+ | The transform program will process your oai_dc harvested files, first by concatenating them into raw files and then by transforming them into BibClass files. The <code>$DLXSROOT/bin/o/oaister/oaitransform/oai-BibClass3.xsl</code> file is used to perform the mapping from oai_dc to BibClass. | ||
+ | |||
+ | The repository report at the end of the transform will provide a number of statistics. | ||
+ | |||
+ | <pre> | ||
+ | Repository Report: bristol | ||
+ | records with URLs = 818 | ||
+ | records without URLs = 5 | ||
+ | repository records = 823 | ||
+ | success rate = 99.39% | ||
+ | ------------------------ | ||
+ | data conditioning msgs? = YES! | ||
+ | deleted records (.del) = 0 | ||
+ | normalization errors = 2 | ||
+ | raw parse failures = 0 | ||
+ | valid XML result = Yes | ||
+ | HTML removed = Yes | ||
+ | </pre> | ||
+ | |||
+ | * records with URLs: OAIster is only interested in oai_dc records with a dc:identifier beginning with http or ftp, so the transform engines only transforms those records with those dc:identifiers. | ||
+ | * records without URLs: The remainder of the records. | ||
+ | * repository records: All the oai_dc records harvested. | ||
+ | * data conditioning msgs: If there are character errors during transformation, these are fixed by OAITransform. To see the character errors that have been fixed after a transformation, see <code>$DLXSROOT/bin/o/oaister/errors/utf8_status_log.txt</code> | ||
+ | * deleted records (.del): Not used unless you want to re-write the harvester to mark deleted records with a .del extension. | ||
+ | * normalization errors: The transform tool uses the <code>$DLXSROOT/bin/o/oaister/oaitransform/normal_types.txt</code> file to normalize the dc:type field values into five distinct BibClass TYPE values, i.e., text, image, audio, video, dataset. If the values in the dc:type fields can't be normalized because there are not mappings for them, these will be logged to an error file located at <code>$DLXSROOT/bin/o/oaister/errors/normalization_errors.txt</code> | ||
+ | * raw parse failures: If there are encoding errors which the transform tool cannot fix, these will be indicated during the transform. | ||
+ | * valid XML result: BibClass is now validated during transform. | ||
+ | * HTML removed: Embedded HTML is now removed from K, AA, SU and X fields. To run transform without this fix, use the -s flag. | ||
+ | |||
+ | For questions on how to transform MODS records, please contact Kat Hagedorn at khage at umich dot edu. | ||
+ | |||
+ | ==Data Conditioning== | ||
+ | |||
+ | ===Transcoding from Latin-1 to UTF-8=== | ||
If the character encoding of the harvested data does not pass the Perl Encode::is_utf8 test, the data is assumed to be Latin-1 encoded and is transcoded from Latin-1 (iso-8859-1) to UTF-8. This heuristic is useful in the vast majority of cases. It is important to note it is impossible to determine the encoding if it is not specified or adherence to OAI specifications has lapsed. | If the character encoding of the harvested data does not pass the Perl Encode::is_utf8 test, the data is assumed to be Latin-1 encoded and is transcoded from Latin-1 (iso-8859-1) to UTF-8. This heuristic is useful in the vast majority of cases. It is important to note it is impossible to determine the encoding if it is not specified or adherence to OAI specifications has lapsed. | ||
- | ==ISO Character Entity Reference (CER) Mapping== | + | ===ISO Character Entity Reference (CER) Mapping=== |
With the exception of the five reserved XML CERs (lt, gt, apos, quot, amp), the ISO Character Entity References from ISOAMSA, ISOAMSB, ISOAMSC ISOAMSN, ISOAMSO, ISOAMSR, ISOCYR1, ISOCYR2, ISOGRK1, ISOGRK2, ISOGRK3, ISOGRK4, ISOLAT1, ISOLAT2, ISOMFRK, ISONUM, ISOPUB, ISOTECH, MMLALIAS, and MMLEXTRA are translated into their corresponding UTF-8 encoded Unicode characters. It is usually an error for CERs from these sets to appear in the XML because they require an internal subset declaration to make the XML valid. This also improves searchability of the records and decreases file size. | With the exception of the five reserved XML CERs (lt, gt, apos, quot, amp), the ISO Character Entity References from ISOAMSA, ISOAMSB, ISOAMSC ISOAMSN, ISOAMSO, ISOAMSR, ISOCYR1, ISOCYR2, ISOGRK1, ISOGRK2, ISOGRK3, ISOGRK4, ISOLAT1, ISOLAT2, ISOMFRK, ISONUM, ISOPUB, ISOTECH, MMLALIAS, and MMLEXTRA are translated into their corresponding UTF-8 encoded Unicode characters. It is usually an error for CERs from these sets to appear in the XML because they require an internal subset declaration to make the XML valid. This also improves searchability of the records and decreases file size. | ||
- | ==Numeric Character Reference (NCR) Mapping== | + | ===Numeric Character Reference (NCR) Mapping=== |
Numeric Character References of the form <code>&#xXXXX;</code> where XXXX represents a hexadecimal number of one to 4 hexadecimal digits and <code>&#YYYY;</code> where YYYY represents a decimal number of one to 4 decimal digits are mapped to their equivalent UTF-8 encoded Unicode characters. This is not strictly necessary but improves searchability. The five reserved XML characters, if represented by NCRs, are not mapped. | Numeric Character References of the form <code>&#xXXXX;</code> where XXXX represents a hexadecimal number of one to 4 hexadecimal digits and <code>&#YYYY;</code> where YYYY represents a decimal number of one to 4 decimal digits are mapped to their equivalent UTF-8 encoded Unicode characters. This is not strictly necessary but improves searchability. The five reserved XML characters, if represented by NCRs, are not mapped. | ||
- | ==Miscellaneous Character Problems== | + | ===Miscellaneous Character Problems=== |
* Some URLs within dc:identifier erroneously have naked &-separated parameters. Naked '&' characters are converted to '&amp;' in this event but the reserved XML character entity references are protected. '&' characters erroneously followed by whitespace are converted to '&amp;'. | * Some URLs within dc:identifier erroneously have naked &-separated parameters. Naked '&' characters are converted to '&amp;' in this event but the reserved XML character entity references are protected. '&' characters erroneously followed by whitespace are converted to '&amp;'. | ||
Line 26: | Line 73: | ||
<pre> | <pre> | ||
- | |||
"Legal characters are tab, carriage return, line feed, and the legal characters of Unicode | "Legal characters are tab, carriage return, line feed, and the legal characters of Unicode | ||
and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were | and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were | ||
Line 37: | Line 83: | ||
* For this purpose we use a modified version of the [http://www.cs.cornell.edu/people/simeon/software/utf8conditioner/ utf8conditioner] written by Simeon Warner at Cornell University (simeon at cs dot cornell dot edu). | * For this purpose we use a modified version of the [http://www.cs.cornell.edu/people/simeon/software/utf8conditioner/ utf8conditioner] written by Simeon Warner at Cornell University (simeon at cs dot cornell dot edu). | ||
- | |||
[[#top|Top]] | [[#top|Top]] |
Current revision
Main Page > Ancillary Resources > OAI Transform
Contents |
[edit] Overview
This document describes how to transform and normalize harvested records into BibClass. It also describes conditioning that takes place to transcode from Latin-1 to UTF-8, and to handle ISO Character Entity Reference Mapping, Numeric Character Reference Mapping, and miscellaneous character problems.
[edit] Transform engine (OAITransform)
OAITransform creates a concatenated BibClass file of all oai_dc records, per repository harvested. To start the transform tool use ./oaitransform/OAITransform [repository_id]
from within $DLXSROOT/bin/o/oaister/oaitransform/
Add the repository id you want to transform. This id is taken from repository_table.txt
, which you will build using repository_table.sample.txt
as your starting point, e.g.,
./oaitransform/OAITransform celebration
The transform program will process your oai_dc harvested files, first by concatenating them into raw files and then by transforming them into BibClass files. The $DLXSROOT/bin/o/oaister/oaitransform/oai-BibClass3.xsl
file is used to perform the mapping from oai_dc to BibClass.
The repository report at the end of the transform will provide a number of statistics.
Repository Report: bristol records with URLs = 818 records without URLs = 5 repository records = 823 success rate = 99.39% ------------------------ data conditioning msgs? = YES! deleted records (.del) = 0 normalization errors = 2 raw parse failures = 0 valid XML result = Yes HTML removed = Yes
- records with URLs: OAIster is only interested in oai_dc records with a dc:identifier beginning with http or ftp, so the transform engines only transforms those records with those dc:identifiers.
- records without URLs: The remainder of the records.
- repository records: All the oai_dc records harvested.
- data conditioning msgs: If there are character errors during transformation, these are fixed by OAITransform. To see the character errors that have been fixed after a transformation, see
$DLXSROOT/bin/o/oaister/errors/utf8_status_log.txt
- deleted records (.del): Not used unless you want to re-write the harvester to mark deleted records with a .del extension.
- normalization errors: The transform tool uses the
$DLXSROOT/bin/o/oaister/oaitransform/normal_types.txt
file to normalize the dc:type field values into five distinct BibClass TYPE values, i.e., text, image, audio, video, dataset. If the values in the dc:type fields can't be normalized because there are not mappings for them, these will be logged to an error file located at$DLXSROOT/bin/o/oaister/errors/normalization_errors.txt
- raw parse failures: If there are encoding errors which the transform tool cannot fix, these will be indicated during the transform.
- valid XML result: BibClass is now validated during transform.
- HTML removed: Embedded HTML is now removed from K, AA, SU and X fields. To run transform without this fix, use the -s flag.
For questions on how to transform MODS records, please contact Kat Hagedorn at khage at umich dot edu.
[edit] Data Conditioning
[edit] Transcoding from Latin-1 to UTF-8
If the character encoding of the harvested data does not pass the Perl Encode::is_utf8 test, the data is assumed to be Latin-1 encoded and is transcoded from Latin-1 (iso-8859-1) to UTF-8. This heuristic is useful in the vast majority of cases. It is important to note it is impossible to determine the encoding if it is not specified or adherence to OAI specifications has lapsed.
[edit] ISO Character Entity Reference (CER) Mapping
With the exception of the five reserved XML CERs (lt, gt, apos, quot, amp), the ISO Character Entity References from ISOAMSA, ISOAMSB, ISOAMSC ISOAMSN, ISOAMSO, ISOAMSR, ISOCYR1, ISOCYR2, ISOGRK1, ISOGRK2, ISOGRK3, ISOGRK4, ISOLAT1, ISOLAT2, ISOMFRK, ISONUM, ISOPUB, ISOTECH, MMLALIAS, and MMLEXTRA are translated into their corresponding UTF-8 encoded Unicode characters. It is usually an error for CERs from these sets to appear in the XML because they require an internal subset declaration to make the XML valid. This also improves searchability of the records and decreases file size.
[edit] Numeric Character Reference (NCR) Mapping
Numeric Character References of the form &#xXXXX;
where XXXX represents a hexadecimal number of one to 4 hexadecimal digits and &#YYYY;
where YYYY represents a decimal number of one to 4 decimal digits are mapped to their equivalent UTF-8 encoded Unicode characters. This is not strictly necessary but improves searchability. The five reserved XML characters, if represented by NCRs, are not mapped.
[edit] Miscellaneous Character Problems
- Some URLs within dc:identifier erroneously have naked &-separated parameters. Naked '&' characters are converted to '&' in this event but the reserved XML character entity references are protected. '&' characters erroneously followed by whitespace are converted to '&'.
- Truncated decimal and hexadecimal NCRs (e.g. &#, &#x, ,, etc.) are replaced with the characters following the # or #x portion of the truncated entity (mainly for the purposes of locating the string for a human editor). Entities truncated at the '&', if not followed by whitespace may look like (most likely invalid) CERs, if followed by a ';' within reasonable proximity to the '&'. These are trapped as invalid by the CER mapping.
- UTF-8 encoded Unicode characters with codepoints higher than U+FFFF are converted to the canonical name string representation of the character as defined in UnicodeData.txt. Since NCRs above U+FFFF have previously been converted to UTF-8 encoded characters this trap handles all encoded forms of non-Plane Zero Unicode characters. This is done because XPAT does not support the indexing of codepoints higher that U+FFFF.
- A variety of Windows-1252 characters are converted to their UTF-8 encoded equivalent. Windows-1252 uses the control character range above 0x7f to encode characters like smart quotes. These often inadvertently become part of OAI data sourced from Windows applications.
- Illegal characters per the XML 1.0 standard are replaced by the '?' character. From http://www.w3.org/TR/2000/REC-xml-20001006 (sec 2.2, extracted 16July2001):
"Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were current at the time this document was prepared. New characters may be added to these standards by amendments or new editions. Consequently, XML processors must accept any character in the range specified for Char. The use of "compatibility characters", as defined in section 6.8 of Unicode (see also D21 in section 3.6 of Unicode3), is discouraged. Character Range Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]."
- For this purpose we use a modified version of the utf8conditioner written by Simeon Warner at Cornell University (simeon at cs dot cornell dot edu).