OAI Provider

From DLXS Documentation

(Difference between revisions)
Jump to: navigation, search
(Overview)
Line 1: Line 1:
-
[[DLXS Wiki|Main Page]] > [[Ancillary Resources]] > OAI Provider System
+
[[DLXS Wiki|Main Page]] > [[Ancillary Resources]] > OAI Provider
Line 6: Line 6:
Below you will find the documentation for using the UMProvider and LoadOai.pl.  These tools, along with the [[OAI Harvesting System|OAI Harvester]], are open source and available for download from [http://sourceforge.net/projects/umoaitoolkit Source Forge].
Below you will find the documentation for using the UMProvider and LoadOai.pl.  These tools, along with the [[OAI Harvesting System|OAI Harvester]], are open source and available for download from [http://sourceforge.net/projects/umoaitoolkit Source Forge].
-
We also offer an older data provider tool-- [[Broker20: An OAI-compliant Metadata Server|broker20]], which uses XPAT and collmgr to function.
+
* For an overview of the verbs and components of the OAI Protocol for Metadata Harvesting (OAI-PMH), see the [http://www.openarchives.org/ Open Archives] site.
 +
 
 +
DLXS also offers an older data provider tool-- [[Broker20: An OAI-compliant Metadata Server|broker20]], which uses XPAT and collmgr to function.
==UMProvider==
==UMProvider==
Line 26: Line 28:
</pre>
</pre>
-
The second table for oai set information (''oaisets'') is optional just like the use of sets in OAI-PMH 2.0.
+
The second table for OAI set information (''oaisets'') is optional just like the use of sets in [http://www.openarchives.org/OAI/openarchivesprotocol.html OAI-PMH 2.0].
-
If you would like to provide additional metadata formats such as marc21 or mods add these columns after oai_dc in the first (main) table.  The column name must match the metadata format.  Here are example create table statements:
+
If you would like to provide additional metadata formats such as marc21 or mods, add these columns after oai_dc in the first (main) table.  The column name must match the metadata format.  Here are example create table statements:
<pre>
<pre>
Line 40: Line 42:
</pre>
</pre>
-
The ''id'' values in the ''oai'' table and the ''oaisets'' tables must match.  The id in the oai table must be unique but since an item can exist in multiple oai sets that ''id'' can repeat in the oaisets table.  Records do not have to be assigned to a set so it is possible that an item in the ''oai'' table does not exist in the ''oaisets'' table.
+
The ''id'' values in the ''oai'' table and the ''oaisets'' tables must match.  The id in the oai table must be unique but since an item can exist in multiple oai sets, that ''id'' can repeat in the oaisets table.  Records do not have to be assigned to a set so it is possible that an item in the ''oai'' table does not exist in the ''oaisets'' table.
'''IMPORTANT:''' If you have hierarchical sets, be sure to have an entry for each unique set name down to the root set for each item.  For example, if the item "abc123" is in the set "foo:bar:baz", you must have the following three rows in the ''oaisets'' table:
'''IMPORTANT:''' If you have hierarchical sets, be sure to have an entry for each unique set name down to the root set for each item.  For example, if the item "abc123" is in the set "foo:bar:baz", you must have the following three rows in the ''oaisets'' table:
 +
<pre>
   +--------+-------------+
   +--------+-------------+
   | id    | oaiset      |
   | id    | oaiset      |
Line 51: Line 54:
   | abc123 | foo        |
   | abc123 | foo        |
   +--------+-------------+
   +--------+-------------+
 +
</pre>
The id in the database is only the brief unique identifier and not the full OAI identifier.  For the OAI-PMH response, the identifier is created with the host name (''repositoryIdentifier'') from the XML configuration file (oai:host:id).
The id in the database is only the brief unique identifier and not the full OAI identifier.  For the OAI-PMH response, the identifier is created with the host name (''repositoryIdentifier'') from the XML configuration file (oai:host:id).
Line 57: Line 61:
'''Required perl modules:''' POSIX, XML::LibXML, DBI, Encode
'''Required perl modules:''' POSIX, XML::LibXML, DBI, Encode
-
 
===Synopsis===
===Synopsis===
Line 98: Line 101:
   print $op->toString();
   print $op->toString();
</pre>
</pre>
-
 
-
 
'''Public Methods''' available to CGI scripts using UMProvider:
'''Public Methods''' available to CGI scripts using UMProvider:
*new(): Creates a new UMProvider object
*new(): Creates a new UMProvider object
 +
<pre>
<pre>
   $op->new( hash_ref )
   $op->new( hash_ref )
Line 120: Line 122:
</pre>
</pre>
-
'''Note:''' When the database table holding the OAI data is altered ('''Update_time''' changes for the table), any outstanding resumption tokens becomes invalid.  If your data is constantly updated, or the OAI table is frequently updated for some reason other than data changes, use the optional '''DbUpdate''' time in your CGI script.
+
'''Note:''' When the database table holding the OAI data is altered ('''Update_time''' changes for the table), any outstanding resumptionTokens becomes invalid.  If your data is constantly updated, or the OAI table is frequently updated for some reason other than data changes, use the optional '''DbUpdate''' time in your CGI script.
-
 
+
*ConnectToDb(): Connect to the database
*ConnectToDb(): Connect to the database
<pre>$op->ConnectToDb( $db_user, $db_passwd, $db_name, $db_server )</pre>
<pre>$op->ConnectToDb( $db_user, $db_passwd, $db_name, $db_server )</pre>
-
 
+
*$op->BuildResponse(): Builds a response based on the cgi parameters passed in from new().  If a problem is found, an OAI error node is created and added to the response object.
-
*$op->BuildResponse(): Builds a response based on the cgi parameters passed in from new().   
+
-
If a problem is found, an OAI error node is created and added to the response object.
+
-
 
+
*$op->toString(): Serialize the response.
*$op->toString(): Serialize the response.
-
 
+
*$op->DisconnectDb(): Disconnect from the databaseThe connection to the DB should be closed when the CGI script terminates.  This may be useful if your CGI script is doing something else after getting the response from the UMProvider object.
-
*$op->DisconnectDb(): Disconnect from the database
+
-
The connection to the DB should be closed when the CGI script terminates.  This may be useful if your CGI script is doing something else after getting the response from the UMProvider object.
+
===Configuration===
===Configuration===
-
An XML configuration file must be passed to the UMProvider in new().  This file contains the information for the Identify verb, ListSets verb, and all possible metadataFormats (for ListMetadataFormats).
+
An XML configuration file must be passed to the UMProvider in new().  This file contains the information for the Identify verb, ListSets verb, and all possible metadata formats (for ListMetadataFormats).
 +
''Sample config file:''
-
''Sample config file:''
 
<pre>
<pre>
         <?xml version="1.0" encoding="UTF-8"?>
         <?xml version="1.0" encoding="UTF-8"?>
Line 218: Line 214:
For the identifier, the script only cares about everything after the last ":" (oai:host:id).
For the identifier, the script only cares about everything after the last ":" (oai:host:id).
-
*DB format (see above under UMProvider)
+
*DB Format (see above under UMProvider)
*XML Format
*XML Format
-
The XML files must have the OAI record elements wrapped in a '''<record>''' element (just like OAI-PMH 2.0).  By default, this script will look for record elements wrapped in a '''<records>''' tag.  If you have the data nested some other way, just change <code>$recordXpath</code> value in the script to the correct xpath for finding the '''<record>''' elements.  There can be multiple record elements in a single file as well as multiple XML files in the data directory.  Here’s an example XML file:
+
The XML files must have the OAI record elements wrapped in a '''<record>''' element (just like OAI-PMH 2.0).  By default, this script will look for record elements wrapped in a '''<records>''' tag.  If you have the data nested some other way, just change the <code>$recordXpath</code> value in the script to the correct xpath for finding the '''<record>''' elements.  There can be multiple record elements in a single file as well as multiple XML files in the data directory.  Here’s an example XML file:
<pre>
<pre>

Revision as of 10:31, 4 December 2007

Main Page > Ancillary Resources > OAI Provider


Contents

Overview

Below you will find the documentation for using the UMProvider and LoadOai.pl. These tools, along with the OAI Harvester, are open source and available for download from Source Forge.

  • For an overview of the verbs and components of the OAI Protocol for Metadata Harvesting (OAI-PMH), see the Open Archives site.

DLXS also offers an older data provider tool-- broker20, which uses XPAT and collmgr to function.

UMProvider

UMProvider: OAI-PMH 2.0 Provider Perl module. UMProvider requires that you have pre-formed oai_dc metadata in a database. The default database is mysql but any DBI.pm supported database should work. The data must be stored in two tables with the following required columns:

  +----+--------------------+
  | table: oai              |
  +----+-----------+--------+
  | id | timestamp | oai_dc |
  +----+-----------+--------+

  +----+-----------+
  | table: oaisets |
  +----+-----------+
  | id | oaiset    |
  +----+-----------+

The second table for OAI set information (oaisets) is optional just like the use of sets in OAI-PMH 2.0.

If you would like to provide additional metadata formats such as marc21 or mods, add these columns after oai_dc in the first (main) table. The column name must match the metadata format. Here are example create table statements:

  CREATE TABLE oai (id VARCHAR(20) PRIMARY KEY,
    timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    oai_dc MEDIUMBLOB,
    marc21 MEDIUMBLOB);

  CREATE TABLE oaisets (id VARCHAR(20), 
    oaiset VARCHAR(10);

The id values in the oai table and the oaisets tables must match. The id in the oai table must be unique but since an item can exist in multiple oai sets, that id can repeat in the oaisets table. Records do not have to be assigned to a set so it is possible that an item in the oai table does not exist in the oaisets table.

IMPORTANT: If you have hierarchical sets, be sure to have an entry for each unique set name down to the root set for each item. For example, if the item "abc123" is in the set "foo:bar:baz", you must have the following three rows in the oaisets table:

  +--------+-------------+
  | id     | oaiset      |
  +--------+-------------+
  | abc123 | foo:bar:baz |
  | abc123 | foo:bar     |
  | abc123 | foo         |
  +--------+-------------+

The id in the database is only the brief unique identifier and not the full OAI identifier. For the OAI-PMH response, the identifier is created with the host name (repositoryIdentifier) from the XML configuration file (oai:host:id).

Deleted records: If you would like to mark a record as deleted, just set the "oai_dc" field for that record to NULL in the database. The UMProvider will continue to return the header for this record and the header will have the deleted status attribute.

Required perl modules: POSIX, XML::LibXML, DBI, Encode

Synopsis

Here is an example CGI script using UMProvider:

  use OaiProvider;
  use CGI;
  use CGI::Carp;

  my $query  = CGI->new();
  my @params = $query->param();
  my $url    = $query->url();

  my $args = {};
  foreach ( $query->param() ) { my @v = $query->param($_); $args->{$_} = "@v"; }

  my $op = new OaiProvider(
      configFile => "oai_provider_conf.xml",
      logFile    => "oai_provider.log",
      url        => $url,
      arguments  => $args);

  if ( $op !~ /OaiProvider/ )
  {
      carp ($op);
      print $query->header(-status => 500);
      exit;
  }

  if ( ! $op->ConnectToDb( $db_user, $db_passwd, $db_name, $db_server ) )
  {
      carp ("failed ConnectToDb: $db_user, $db_passwd, $db_name, $db_server");
      print $query->header(-status => 500);
      exit;
  }

  $op->BuildResponse();
  print $query->header(-type => ’text/xml’, -charset => ’utf-8’, -status => 200);
  print $op->toString();

Public Methods available to CGI scripts using UMProvider:

  • new(): Creates a new UMProvider object
  $op->new( hash_ref )
          { configFile => "oai_provider_conf.xml",
            logFile    => "oai_provider.log",
            url        => "http://some.url.org/OAI",
            arguments  => $args, ($args->{verb} = "ListSets")

            ## optional
            maxItems     => 500,                   ## default 100
            tableName    => "my_oai",              ## default "oai"
            setTableName => "my_oaisets",          ## default "oaisets"
            DBDriver     => "SQLite",              ## default "mysql"
            DbUpdate     => "2007-12-25 10:00:01", ## default to check update time of table
          }

Note: When the database table holding the OAI data is altered (Update_time changes for the table), any outstanding resumptionTokens becomes invalid. If your data is constantly updated, or the OAI table is frequently updated for some reason other than data changes, use the optional DbUpdate time in your CGI script.

  • ConnectToDb(): Connect to the database
$op->ConnectToDb( $db_user, $db_passwd, $db_name, $db_server )
  • $op->BuildResponse(): Builds a response based on the cgi parameters passed in from new(). If a problem is found, an OAI error node is created and added to the response object.
  • $op->toString(): Serialize the response.
  • $op->DisconnectDb(): Disconnect from the database. The connection to the DB should be closed when the CGI script terminates. This may be useful if your CGI script is doing something else after getting the response from the UMProvider object.

Configuration

An XML configuration file must be passed to the UMProvider in new(). This file contains the information for the Identify verb, ListSets verb, and all possible metadata formats (for ListMetadataFormats).

Sample config file:

        <?xml version="1.0" encoding="UTF-8"?>
        <oai_config>
          <Identify>
            <repositoryName>Your Repository</repositoryName>
            <baseURL>http://your.host.edu/OAI</baseURL>
            <protocolVersion>2.0</protocolVersion>
            <adminEmail>you@your.edu</adminEmail>
            <earliestDatestamp/>
            <deletedRecord>transient</deletedRecord>
            <granularity>YYYY-MM-DDThh:mm:ssZ</granularity>
            <description>
              <oai-identifier xmlns="http://www.openarchives.org/OAI/2.0/oai-identifier"
                  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                  xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai-identifier
                  http://www.openarchives.org/OAI/2.0/oai-identifier.xsd">
                <scheme>oai</scheme>
                <repositoryIdentifier>your.host.edu</repositoryIdentifier>
                <delimiter>:</delimiter>
                <sampleIdentifier>oai:your.host.edu:000000001</sampleIdentifier>
              </oai-identifier>
            </description>
          </Identify>
          <ListSets>
            <set>
              <setSpec>foo</setSpec>
              <setName>All things of foo</setName>
            </set>
            <set>
              <setSpec>bar</setSpec>
              <setName>All things of bar</setName>
            </set>
          </ListSets>
          <PossibleMetadataFormats>
            <metadataFormat>
              <metadataPrefix>oai_dc</metadataPrefix>
              <schema>http://www.openarchives.org/OAI/2.0/oai_dc.xsd</schema>
              <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc</metadataNamespace>
            </metadataFormat>
            <metadataFormat>
              <metadataPrefix>marc21</metadataPrefix>
              <schema>http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd</schema>
              <metadataNamespace>http://www.loc.gov/MARC21/slim</metadataNamespace>
            </metadataFormat>
          </PossibleMetadataFormats>
        </oai_config>

LoadOai.pl

This script looks for XML files in the data directory (data dir) and loads each record into the oai and oaisets database tables. Before using you must change the DB connection settings at the top of this script:

  my $dbUser   = "user";
  my $dbPasswd = "passwd";
  my $dbName   = "foo";
  my $dbServer = "bar";

The available arguments for the script are:

  -d data dir (place to find the XML data)
  -s oai set (optional)
  -h (help: message printed)
  -v (generates verbose output which is stored in the loading Log)

The only required argument is the data dir (-d).

All of the data within and including the <metadata> element is loaded into the DB under the specified format. The script should be able to identify oai_dc and marc21 but any other metadata format may require code modifications. This data is not validated or checked in any way other than to make sure it is well formed XML.

If the oai set option (-s) is used, the setSpec is ignored in the /record/header/ element for all records being loaded.

For the identifier, the script only cares about everything after the last ":" (oai:host:id).

  • DB Format (see above under UMProvider)
  • XML Format

The XML files must have the OAI record elements wrapped in a <record> element (just like OAI-PMH 2.0). By default, this script will look for record elements wrapped in a <records> tag. If you have the data nested some other way, just change the $recordXpath value in the script to the correct xpath for finding the <record> elements. There can be multiple record elements in a single file as well as multiple XML files in the data directory. Here’s an example XML file:

    <?xml version="1.0" encoding="UTF-8"?>
    <records>
      <record>
        <header>
          <identifier>oai:some.host.edu:id-1234</identifier>
          <setSpec>foo</setSpec>
        </header>
        <metadata> [ ... ]
        </metadata>
      </record>
      <record> [ ... ] </record>
    </records>
Personal tools