OAI Harvester

From DLXS Documentation

Jump to: navigation, search

Main Page > Ancillary Resources > OAI Harvester

[edit] Overview

This document details how to run the harvester used for harvesting OAI records from data providers.

This tool, along with the OAI Provider, is open source and available for download from Source Forge (UMich OAI Toolkit) as well as included in DLXS release 14.

Also available in the UMich OAI Toolkit is a simple script (OaiList.pl) that can be used for harvesting OAI data. Details on how to use that can be found here.

[edit] Harvester (UMHarvester)

N.B.: To use the harvester in your system, you may have to make changes to the Global Parameters located at the beginning of the UMHarvester script.

To start the harvester use ./UMHarvester from within $DLXSROOT/bin/o/oaister/scripts/

These flags let you perform harvesting:

  • -i: The id of the repository. This is pulled from $DLXSROOT/bin/o/oaister/scripts/id_URL_table.txt, a text file that lists the id and the baseURL of repositories, e.g., cogprints=http://cogprints.ecs.soton.ac.uk/perl/oai2. (Use id_URL_table.sample.txt to get started.) You can run multiple harvests at the same time by separating ids with commas.
  • -v: The verbs. You can specify:
    lr = ListRecords
    id = Identify
    ls = ListSets
    lf = ListMetadataFormats
    When running ListRecords, if the folder that will contain the repository's record already exists, it will place the original folder in a backup place. Currently, this is set up as such:
    • Records are placed in $DLXSROOT/prep/h/harvester/[repository_id]
    • Backup is in $DLXSROOT/prep/h/harvester_other/backup/[repository_id]
    • Logs of ListRecords output are placed in $DLXSROOT/bin/o/oaister/scripts/log/. The active.log is overwritten for each (set of) harvests run. The [repository_id].log is appended with successful harvests or harvesting errors.
  • -s: To harvest sets for a particular repository. Use the setSpec name for a particular set.
  • -f: To harvest records in the following metadata formats:
    mods
    oai_marc
    marc21
    marc21a
    marc21b
    marcxml
    The harvester will harvest in oai_dc format unless a metadata format is specified.
  • -n: To harvest records from the last harvest date for a repository (i.e., incremental harvesting). This flag checks the datestamp granularity from the Identify response and starts harvesting either the next second after the last harvest finished (e.g., last harvest finished 2005-03-15T10:45:46Z, incremental harvest starts at 2005-03-15T10:45:47Z) or the next day after the last harvest finished (e.g., last harvest finished 2005-03-15, incremental harvest starts at 205-03-16). The flag creates a tar.gz copy of the repository's records and puts that in the backup directory. As the incremental harvest runs, it checks the contents of the repository's directory and replaces those records that have been modified. If it finds no replacement for a harvested record, it places this in a directory specific to the incremental harvest date (e.g., cogprints1-1000_2007-03-28).
  • -a: To retry a harvest following a timeout. This flag will wait 5 minutes to retry a timeout on a particular resumptionToken. The limit is 10 retries for a particular harvest.

Examples:

	./UMHarvester -i auburn,epsilondiss,rdn -v ls
	./UMHarvester -i cogprints -v id
	./UMHarvester -i cogprints -v id
	./UMHarvester -i uiucimages -v lr -s ALA
	./UMHarvester -i lcoa1 -v lr -f mods
	./UMHarvester -i forex -v lr -n
	./UMHarvester -i CCSDthesis -v lr -a

The Batch_UMHarvest file is used to run automated incremental harvests on repositories. See the $DLXSROOT/bin/o/oaister/scripts/Batch_UMHarvest_sample file for an example.

	my @Monday = 
	(['uiucimages', 'ALA', 'oai_dc', 'dr', 's'],
	);

Add your own repository id, set, metadata format, run specification (r to run, dr to not run OAITransform), and flag to skip HTML removal for each repository you wish to batch harvest. Batch_UMHarvest will perform an incremental harvest from the last time you harvested, based on the .log file for that repository id.

Rename Batch_UMHarvest_sample to Batch_UMHarvest to use. To start the Batch_UMHarvest run

       ./Batch_UMHarvest -d M &

from within $DLXSROOT/bin/o/oaister/scripts/. This will run all the repository ids within the "M" (or Monday) batch harvest group.

Top

Personal tools