OaiList

From DLXS Documentation

Jump to: navigation, search

Contents

[edit] Overview

Included in the UMich OAI Toolkit and in version 14+ releases of DLXS is a simple script (OaiList.pl) for making OAI-PMH ListIdentifiers and ListRecords requests. This can be used for basic OAI harvesting needs using the ListRecords request.

[edit] Basic Setup

Before this can be used a few things need to be configured in the script:

1) The path to perl on the first line must be change depending on where your perl binary is location.

   # which perl
   /usr/bin/perl
   Change the first line to: #!/usr/bin/perl

2) Copy and edit the sample RepositoryConfig.cfg config file:

    # cp RepositoryConfig.sample.cfg RepositoryConfig.cfg 
    # edit RepositoryConfig.cfg to include the repositories that you would like to harvest.
    

    <repository id="[unique ID]">
      <baseUrl>[OAI-PMH base URL for provider]</baseUrl>
      <fullName>[human readable name for the provider]</fullName>
    </repository>

An example:

    <repository id="umich">
      <baseUrl>http://quod.lib.umich.edu/cgi/o/oai/oai</baseUrl>
      <fullName>University of Michigan Library Repository</fullName>
    </repository>

NOTE: If you are not running the OaiList.pl script from it's unpacked location umich_oai_toolkit/harvester/, you will need to change the path to the configuration file in the script here:

   my $repositoryConfigFile = "../RepositoryConfig.cfg";

3) If you want to harvest data to the default location, you will need to make a directory (oai/) where the script resides (umich_oai_toolkit/harvester/:

   # cd umich_oai_toolkit/harvester/
   # mkdir oai/

If you want to harvest your data to another location, just change the $dataDir variable in the OaiList.pl script:

[edit] Using the script

There are a number of command line parameters that can be seen by running the script with no arguments:

   # ./OaiList.pl 
   USAGE: ./OaiList.pl -i repo_id 
       [-t type_of_list (default: records) || (identifiers)]
       [-s set_id (other than what is in config)] 
       [-m metadata_format (default oai_dc)]
       [-f from (YYYY-MM-DDThh:mm:ssZ)]
       [-u until (YYYY-MM-DDThh:mm:ssZ)]
       [-x max_request (default 100000)]
       [-w (write individual record files, not a ListRecords file)]
       [-d data_dir]
       [-v (verbose)]
       [-r (to resume using token file)

In detail:

-i repo_id: This is the only required argument. This needs to be the ID for the repository as it is configured in the RepositorConfig.cfg file describe earlier. In the above example this would be umich.

-t type_of_list (default: records) || (identifiers): To perform a ListIdentifier request, use -t identifiers. Otherwise ListRecords will we called.

-s set_id (other than what is in config) Used to harvest a specific OAI set (See the repositories ListSets for available sets).

-m metadata_format (default oai_dc) Used to harvest a specific OAI metadata format (See reposiitories ListMetadataFormats for available formats).

-f from (YYYY-MM-DDThh:mm:ssZ) Used to specify the from date for selective harvesting as defined by the OAI-PMH 2.0 spec.

-u until (YYYY-MM-DDThh:mm:ssZ) Used to specify the until date for selective harvesting as defined by the OAI-PMH 2.0 spec.

-x max_request (default 100000) Can be used to set the max number of requests to be made.

-w (write individual record files, not a ListRecords file) Use -w to have the script write each record individually. Otherwise each ListRecords response will be written out as it is returned.

-d data_dir Overwrite the default data dirctory as defined in the OaiList.pl script ($dataDir). This is the location where the Oai response data will be saved to disk.

-v (verbose) Verbose output.

-r (to resume using token file) If a previous harvest was interrupted for some reason, this will read the resumption token from the token.txt file (saved in the dataDir location) and will resume harvesting where left off.


[edit] example commands


    # ./OaiList.pl -i umich -s mbooks
      [harvest the ''mbooks'' set from umich]

    # ./OaiList.pl -i umich -s mbooks:pd -x 100
      [harvest the ''mbooks:pd'' set from umich, max 100 requests]

    # ./OaiList.pl -i umich -f 2008-10-01T01:01:01Z
      [harvest new data since (from) Oct. 1, 2008 from umich]

Personal tools