OAI Harvester
From DLXS Documentation
m |
m (OAI Harvesting System moved to OAI Harvester) |
||
(6 intermediate revisions not shown.) | |||
Line 1: | Line 1: | ||
- | [[DLXS Wiki|Main Page]] > [[Ancillary Resources]] > OAI | + | [[DLXS Wiki|Main Page]] > [[Ancillary Resources]] > OAI Harvester |
==Overview== | ==Overview== | ||
- | + | This document details how to run the harvester used for harvesting OAI records from data providers. | |
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | + | ||
==Harvester (UMHarvester)== | ==Harvester (UMHarvester)== | ||
Line 21: | Line 14: | ||
*'''<code>-i</code>''': The id of the repository. This is pulled from $DLXSROOT/bin/o/oaister/scripts/id_URL_table.txt, a text file that lists the id and the baseURL of repositories, e.g., cogprints=http://cogprints.ecs.soton.ac.uk/perl/oai2. (Use id_URL_table.sample.txt to get started.) You can run multiple harvests at the same time by separating ids with commas. | *'''<code>-i</code>''': The id of the repository. This is pulled from $DLXSROOT/bin/o/oaister/scripts/id_URL_table.txt, a text file that lists the id and the baseURL of repositories, e.g., cogprints=http://cogprints.ecs.soton.ac.uk/perl/oai2. (Use id_URL_table.sample.txt to get started.) You can run multiple harvests at the same time by separating ids with commas. | ||
- | *<code>-v</code>: The verbs. You can specify:<br><code>lr</code> = ListRecords<br><code>id</code> = Identify<br><code>ls</code> = ListSets<br><code>lf</code> = ListMetadataFormats<br>When running ListRecords, if the folder that will contain the repository's record already exists, it will place the original folder in a backup place. Currently, this is set up as such: | + | *'''<code>-v</code>''': The verbs. You can specify:<br><code>lr</code> = ListRecords<br><code>id</code> = Identify<br><code>ls</code> = ListSets<br><code>lf</code> = ListMetadataFormats<br>When running ListRecords, if the folder that will contain the repository's record already exists, it will place the original folder in a backup place. Currently, this is set up as such: |
**Records are placed in <code>$DLXSROOT/prep/h/harvester/[repository_id]</code> | **Records are placed in <code>$DLXSROOT/prep/h/harvester/[repository_id]</code> | ||
**Backup is in <code>$DLXSROOT/prep/h/harvester_other/backup/[repository_id]</code> | **Backup is in <code>$DLXSROOT/prep/h/harvester_other/backup/[repository_id]</code> | ||
**Logs of ListRecords output are placed in <code>$DLXSROOT/bin/o/oaister/scripts/log/</code>. The active.log is overwritten for each (set of) harvests run. The <code>[repository_id].log</code> is appended with successful harvests or harvesting errors. | **Logs of ListRecords output are placed in <code>$DLXSROOT/bin/o/oaister/scripts/log/</code>. The active.log is overwritten for each (set of) harvests run. The <code>[repository_id].log</code> is appended with successful harvests or harvesting errors. | ||
- | *<code>-s</code>: To harvest sets for a particular repository. Use the <code>setSpec</code> name for a particular set. | + | *'''<code>-s</code>''': To harvest sets for a particular repository. Use the <code>setSpec</code> name for a particular set. |
- | *<code>-f</code>: To harvest records in the following metadata formats:<br>mods<br>oai_marc<br>marc21<br>marc21a<br>marc21b<br>marcxml<br>The harvester will harvest in oai_dc format unless a metadata format is specified. | + | *'''<code>-f</code>''': To harvest records in the following metadata formats:<br>mods<br>oai_marc<br>marc21<br>marc21a<br>marc21b<br>marcxml<br>The harvester will harvest in oai_dc format unless a metadata format is specified. |
- | *<code>-n</code>: To harvest records from the last harvest date for a repository (i.e., incremental harvesting). This flag checks the datestamp granularity from the Identify response and starts harvesting either the next second after the last harvest finished (e.g., last harvest finished 2005-03-15T10:45:46Z, incremental harvest starts at 2005-03-15T10:45:47Z) or the next day after the last harvest finished (e.g., last harvest finished 2005-03-15, incremental harvest starts at 205-03-16). The flag creates a tar.gz copy of the repository's records and puts that in the backup directory. As the incremental harvest runs, it checks the contents of the repository's directory and replaces those records that have been modified. If it finds no replacement for a harvested record, it places this in a directory specific to the incremental harvest date (e.g., cogprints1-1000_2007-03-28). | + | *'''<code>-n</code>''': To harvest records from the last harvest date for a repository (i.e., incremental harvesting). This flag checks the datestamp granularity from the Identify response and starts harvesting either the next second after the last harvest finished (e.g., last harvest finished 2005-03-15T10:45:46Z, incremental harvest starts at 2005-03-15T10:45:47Z) or the next day after the last harvest finished (e.g., last harvest finished 2005-03-15, incremental harvest starts at 205-03-16). The flag creates a tar.gz copy of the repository's records and puts that in the backup directory. As the incremental harvest runs, it checks the contents of the repository's directory and replaces those records that have been modified. If it finds no replacement for a harvested record, it places this in a directory specific to the incremental harvest date (e.g., cogprints1-1000_2007-03-28). |
- | *<code>-a</code>: To retry a harvest following a timeout. This flag will wait 5 minutes to retry a timeout on a particular resumptionToken. The limit is 10 retries for a particular harvest. | + | *'''<code>-a</code>''': To retry a harvest following a timeout. This flag will wait 5 minutes to retry a timeout on a particular resumptionToken. The limit is 10 retries for a particular harvest. |
Examples: | Examples: | ||
Line 50: | Line 43: | ||
<pre> | <pre> | ||
my @Monday = | my @Monday = | ||
- | (['uiucimages', 'ALA', 'oai_dc', 'dr'], | + | (['uiucimages', 'ALA', 'oai_dc', 'dr', 's'], |
); | ); | ||
</pre> | </pre> | ||
- | Add your own repository id, set, metadata format, | + | Add your own repository id, set, metadata format, run specification (<code>r</code> to run, <code>dr</code> to not run OAITransform), and flag to skip HTML removal for each repository you wish to batch harvest. Batch_UMHarvest will perform an incremental harvest from the last time you harvested, based on the .log file for that repository id. |
Rename Batch_UMHarvest_sample to Batch_UMHarvest to use. To start the Batch_UMHarvest run | Rename Batch_UMHarvest_sample to Batch_UMHarvest to use. To start the Batch_UMHarvest run | ||
Line 63: | Line 56: | ||
from within $DLXSROOT/bin/o/oaister/scripts/. This will run all the repository ids within the "M" (or Monday) batch harvest group. | from within $DLXSROOT/bin/o/oaister/scripts/. This will run all the repository ids within the "M" (or Monday) batch harvest group. | ||
- | |||
- | ==Transform engine (OAITransform)== | ||
- | |||
- | OAITransform creates concatenated BibClass file of all oai_dc records, per repository. To start the transform tool use <code>./oaitransform/OAITransform [repository_id]</code> from within <code>$DLXSROOT/bin/o/oaister/oaitransform/</code> | ||
- | |||
- | Add the repository id you want to transform. This id is taken from <code>repository_table.txt</code>, which you will build using <code>repository_table.sample.txt</code> as your starting point, e.g., | ||
- | |||
- | <pre> | ||
- | ./oaitransform/OAITransform celebration | ||
- | </pre> | ||
- | |||
- | The transform program will process your oai_dc harvested files, first by concatenating them into raw files and then by transforming them into BibClass files. The <code>$DLXSROOT/bin/o/oaister/oaitransform/oai-bibclass3.xsl</code> file is used to perform the mapping from oai_dc to BibClass. | ||
- | |||
- | The repository report at the end of the transform will provide a number of statistics. | ||
- | |||
- | <pre> | ||
- | Repository Report: bristol | ||
- | records with URLs = 818 | ||
- | records without URLs = 5 | ||
- | repository records = 823 | ||
- | success rate = 99.39% | ||
- | ------------------------ | ||
- | data conditioning msgs? = YES! | ||
- | deleted records (.del) = 0 | ||
- | normalization errors = 2 | ||
- | raw parse failures = 0 | ||
- | valid XML result = Yes | ||
- | HTML removed = Yes | ||
- | </pre> | ||
- | |||
- | * records with URLs: OAIster is only interested in oai_dc records with a dc:identifier beginning with http or ftp, so the transform engines only transforms those records with those dc:identifiers. | ||
- | * records without URLs: The remainder of the records. | ||
- | * repository records: All the oai_dc records harvested. | ||
- | * data condtioning msgs: If there are character errors during transformation, these are fixed by OAITransform. See [[OAITransform Data Conditioning|an explanation of the data conditioning]] that can be performed. To see character errors that have been fixed after a transformation, see <code>$DLXSROOT/bin/o/oaister/errors/utf8_status_log.txt</code> | ||
- | * deleted records (.del): Not used unless you want to re-write the harvester to mark deleted records with a .del extension. | ||
- | * normalization errors: The transform tool uses the <code>$DLXSROOT/bin/o/oaister/oaitransform/normal_types.txt</code> file to normalize the dc:type field values into five distinct BibClass TYPE values, i.e., text, image, audio, video, dataset. If the values in the dc:type fields can't be normalized because there are not mappings for them, these will be logged to an error file located at <code>$DLXSROOT/bin/o/oaister/errors/normalization_errors.txt</code> | ||
- | * raw parse failures: If there are encoding errors which the transform tool cannot fix, these will be indicated during the transform. | ||
- | * valid XML result: BibClass is now validated during transform. | ||
- | * HTML removed: Embedded HTML is now removed from K, AA, SU and X fields. To run transform without this fix, use the -s flag. | ||
- | |||
- | For questions on how to transform MODS records, please contact Kat Hagedorn at khage at umich dot edu. | ||
- | |||
[[#top|Top]] | [[#top|Top]] |
Current revision
Main Page > Ancillary Resources > OAI Harvester
[edit] Overview
This document details how to run the harvester used for harvesting OAI records from data providers.
[edit] Harvester (UMHarvester)
N.B.: To use the harvester in your system, you may have to make changes to the Global Parameters located at the beginning of the UMHarvester script.
To start the harvester use ./UMHarvester
from within $DLXSROOT/bin/o/oaister/scripts/
These flags let you perform harvesting:
-i
: The id of the repository. This is pulled from $DLXSROOT/bin/o/oaister/scripts/id_URL_table.txt, a text file that lists the id and the baseURL of repositories, e.g., cogprints=http://cogprints.ecs.soton.ac.uk/perl/oai2. (Use id_URL_table.sample.txt to get started.) You can run multiple harvests at the same time by separating ids with commas.
-v
: The verbs. You can specify:lr
= ListRecordsid
= Identifyls
= ListSetslf
= ListMetadataFormats
When running ListRecords, if the folder that will contain the repository's record already exists, it will place the original folder in a backup place. Currently, this is set up as such:- Records are placed in
$DLXSROOT/prep/h/harvester/[repository_id]
- Backup is in
$DLXSROOT/prep/h/harvester_other/backup/[repository_id]
- Logs of ListRecords output are placed in
$DLXSROOT/bin/o/oaister/scripts/log/
. The active.log is overwritten for each (set of) harvests run. The[repository_id].log
is appended with successful harvests or harvesting errors.
- Records are placed in
-s
: To harvest sets for a particular repository. Use thesetSpec
name for a particular set.
-f
: To harvest records in the following metadata formats:
mods
oai_marc
marc21
marc21a
marc21b
marcxml
The harvester will harvest in oai_dc format unless a metadata format is specified.
-n
: To harvest records from the last harvest date for a repository (i.e., incremental harvesting). This flag checks the datestamp granularity from the Identify response and starts harvesting either the next second after the last harvest finished (e.g., last harvest finished 2005-03-15T10:45:46Z, incremental harvest starts at 2005-03-15T10:45:47Z) or the next day after the last harvest finished (e.g., last harvest finished 2005-03-15, incremental harvest starts at 205-03-16). The flag creates a tar.gz copy of the repository's records and puts that in the backup directory. As the incremental harvest runs, it checks the contents of the repository's directory and replaces those records that have been modified. If it finds no replacement for a harvested record, it places this in a directory specific to the incremental harvest date (e.g., cogprints1-1000_2007-03-28).
-a
: To retry a harvest following a timeout. This flag will wait 5 minutes to retry a timeout on a particular resumptionToken. The limit is 10 retries for a particular harvest.
Examples:
./UMHarvester -i auburn,epsilondiss,rdn -v ls ./UMHarvester -i cogprints -v id ./UMHarvester -i cogprints -v id ./UMHarvester -i uiucimages -v lr -s ALA ./UMHarvester -i lcoa1 -v lr -f mods ./UMHarvester -i forex -v lr -n ./UMHarvester -i CCSDthesis -v lr -a
The Batch_UMHarvest file is used to run automated incremental harvests on repositories. See the $DLXSROOT/bin/o/oaister/scripts/Batch_UMHarvest_sample
file for an example.
my @Monday = (['uiucimages', 'ALA', 'oai_dc', 'dr', 's'], );
Add your own repository id, set, metadata format, run specification (r
to run, dr
to not run OAITransform), and flag to skip HTML removal for each repository you wish to batch harvest. Batch_UMHarvest will perform an incremental harvest from the last time you harvested, based on the .log file for that repository id.
Rename Batch_UMHarvest_sample to Batch_UMHarvest to use. To start the Batch_UMHarvest run
./Batch_UMHarvest -d M &
from within $DLXSROOT/bin/o/oaister/scripts/. This will run all the repository ids within the "M" (or Monday) batch harvest group.