OAI Provider

From DLXS Documentation

(Difference between revisions)
Jump to: navigation, search
m (Synopsis)
Current revision (14:31, 28 October 2010) (edit) (undo)
(Setup within DLXS)
 
(63 intermediate revisions not shown.)
Line 1: Line 1:
-
[[DLXS Wiki|Main Page]] > [[Ancillary Resources]] > OAI Provider System
+
[[DLXS Wiki|Main Page]] > [[Ancillary Resources]] > OAI Provider
==Overview==
==Overview==
-
Below you will find the documentation for using the UMProvider and the generic loading script (LoadOai.pl) used for loading data for the provider. These tools, along with the [[OAI Harvesting System]], are available for download from [http://sourceforge.net/projects/umoaitoolkit Source Forge].  
+
Below you will find the documentation for using the UMProvider and LoadOai.pl, in order to make metadata available to the world using the [http://www.openarchives.org/ OAI Protocol for Metadata Harvesting (OAI-PMH)]. These tools, along with the [[OAI Harvester]], are open source and available for download from [http://sourceforge.net/projects/umoaitoolkit Source Forge] as well as included in DLXS release 14.
 +
 
 +
DLXS also offers an older data provider tool-- [[Broker20: An OAI-compliant Metadata Server|broker20]], which uses XPAT and collmgr to function.
 +
 
 +
==Setup within DLXS==
 +
 
 +
If you are using the UMProvider within a DLXS installation you will be prompted to configure your UMProvider during the installation step (running ''Installer.pl'').  If you are not ready to install your OAI provider when you are installing DLXS, you can come back and just install the UMProvider any time.  To install/configure only the UMProvider use the ''Installer.pl'' with the ''-f'' option like:
 +
 
 +
<pre>
 +
 
 +
  # cd $DLXSROOT/bin/i/installer/
 +
  # perl ./Installer.pl -f $DLXSROOT/cgi/o/oai/oai.cfg
 +
 
 +
</pre>
 +
 
 +
<span class="release_15">'''Release_15'''</span> If you are not using dlxs, you may not have a symlink of the perl interpreter set, therefore, you may need to change the location of the perl interpreter in  $DLXSROOT/cgi/o/oai/oai.
 +
 
 +
<span class="release_15">'''Release_15'''</span> Make sure that you have installed the following modules from CPAN
 +
 
 +
<pre>
 +
 
 +
  Switch
 +
  Date::Parse
 +
  Config::Tiny
 +
  XML::LibXML
 +
  Mail::Mailer
 +
 
 +
</pre>
 +
 
 +
If you have the UMProvider configured correctly, and you have updated your database with the MySQL dump file that is provided with the release, you will '''not''' need to create the database tables or create/modify the CGI script. You should now be able to make the following requests and get proper [http://www.openarchives.org/OAI/openarchivesprotocol.html OAI-PMH] XML responses:
 +
 
 +
<pre>
 +
 
 +
  http://your.host.org/cgi/o/oai/oai?verb=Identify
 +
  http://your.host.org/cgi/o/oai/oai?verb=ListSets
 +
  http://your.host.org/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=oai_dc
 +
 
 +
</pre>
 +
 
 +
If these look good, the DLXS configuration steps are complete.  You can skip down to the OAI [[#Configuration]] step below where you describe your OAI provider. After that you are ready to load data.
 +
 
 +
==Setup outside of DLXS==
 +
 
 +
Basic steps for setting up an OAI-PMH 2.0 Provider using UMProvider.pm downloaded from [http://sourceforge.net/projects/umoaitoolkit Source Forge]:
 +
 
 +
# Create the ''oai'' and ''oaisets'' tables in MySQL (or another database). See CREATE TABLE examples and explanation below.<br/>
 +
# There is a script described below to assist you in loading the metadata but there is no way around the fact that you need to generate Dublin Code (DC) metadata for each record. One option is to wrap the oai_dc records in <metadata> tags and put them into the format for loading with the ''LoadOai.pl'' (details below). If you have another mechanism for loading the data that you would prefer to use, create these records and load everything, including the <metadata> element into the oai_dc field of the oai table. For both options you will also need to have a unique identifier for each record as well as the set(s) information. <br/>We also have scripts and XSLT style sheets that should help convert existing DLXS collection to DC for loading into the database. Information for that can be found here: [[DLXS_to_DC]].<br/>
 +
# Copy ''sample_config.xml'' to ''config.xml'' and edit for your setup.  Be sure to change: '''repositoryName, baseURL, adminEmail, repositoryIdentifier, sampleIdentifier, the list of sets and list of possible metadata formats.'''<br/>
 +
# Place ''oai'' (the CGI script), ''config.xml'' and ''UMProvider.pm'' in a CGI directory on a web server.  <br/>
 +
# Edit ''oai'' (the CGI script) to add
 +
## database connection settings,
 +
## pointer to your config.xml, and
 +
## log file location (details below).<br/>
 +
# Test OAI provider:
 +
 
 +
    http://your.host.org/cgi-bin/oai?verb=Identify
 +
    http://your.host.org/cgi-bin/oai?verb=ListRecords&metadataPrefix=oai_dc
==UMProvider==
==UMProvider==
-
UMProvider: OAI-PMH 2.0 Provider Perl module. UMProvider requires that you have pre-formed oai_dc metadata in a database.  The default database is mysql but any DBI.pm supported database should work.  The data must be stored in two tables with the following required columns:
+
UMProvider: OAI-PMH 2.0 Provider Perl module. The UMProvider relies heavily on [[http://xmlsoft.org/ libxml]] (XML::LibXML) for creating and parsing XML data and a database for storage.
 +
 
 +
UMProvider requires that you have pre-formed oai_dc metadata in a database.  The default database is mysql but any DBI.pm supported database should work.  The data must be stored in two tables with the following required columns:
<pre>
<pre>
-
   +----+-----------+--------+
+
   +----+--------------------+
-
   | table: oai     |        |
+
   | table: oai             |
   +----+-----------+--------+
   +----+-----------+--------+
   | id | timestamp | oai_dc |
   | id | timestamp | oai_dc |
Line 24: Line 82:
</pre>
</pre>
-
The second table for oai set information is optional just like the use of sets in OAI-PMH 2.0.
+
The second table for OAI set information (''oaisets'') is optional just like the use of sets in [http://www.openarchives.org/OAI/openarchivesprotocol.html OAI-PMH 2.0].
-
If you would like to provide additional metadata formats such as marc21 or mods add these columns after oai_dc in the first (main) table.  The column name must match the metadata format.  Here are example create table statements:
+
If you would like to provide additional metadata formats such as marc21 or mods, add these columns after oai_dc in the first (main) table.  The column name must match the metadata format.  <span class="release_15">'''Release_15'''</span> You must also create the tables oai_marc, and oai_mods.  Here are example create table statements:
<pre>
<pre>
-
   CREATE TABLE oai (id VARCHAR(20) PRIMARY KEY,
+
   CREATE TABLE oai (id VARCHAR(50) PRIMARY KEY,
     timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
     timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
     oai_dc MEDIUMBLOB,
     oai_dc MEDIUMBLOB,
-
     marc21 MEDIUMBLOB);
+
    mods  MEDIUMBLOB,
 +
     marc21 MEDIUMBLOB,
 +
    PRIMARY KEY  (id),
 +
    KEY timestamp (timestamp),
 +
    KEY oai_time_id (timestamp, id));
 +
 
 +
  CREATE TABLE oaisets (id VARCHAR(50),
 +
    oaiset VARCHAR(32),
 +
    PRIMARY KEY (id,oaiset),
 +
    KEY 'oaiset' (oaiset));
 +
 
 +
  CREATE TABLE `oai_marc` (
 +
    id  varchar(150) NOT NULL default '',
 +
    timestamp  timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
 +
    marc21  mediumblob,
 +
    PRIMARY KEY  (id),
 +
    KEY timestamp (timestamp),
 +
    KEY oai_time_id (timestamp,id));
 +
 
 +
  CREATE TABLE `oai_mods` (
 +
    id` varchar(150) NOT NULL default '',
 +
    timestamp  timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
 +
    mods  mediumblob,
 +
    PRIMARY KEY  (id),
 +
    KEY timestamp (timestamp),
 +
    KEY oai_time_id (timestamp,id));
-
  CREATE TABLE oaisets (id VARCHAR(20),
 
-
    oaiset VARCHAR(10);
 
</pre>
</pre>
-
The "id" values in the "oai" table and the "oaisets" tables must match.  The id in the oai table must be unique but since an item can exist in multiple oai sets that id can repeat in the oaisets table.  Records do not have to be assigned to a set so it is possible that an item in the oai table does not exist in the oaisets table.
+
The ''id'' values in the ''oai'' table and the ''oaisets'' tables must match.  The id in the oai table must be unique but since an item can exist in multiple oai sets, that ''id'' can repeat in the oaisets table.  Records do not have to be assigned to a set so it is possible that an item in the ''oai'' table does not exist in the ''oaisets'' table.
-
''IMPORTANT:'' If you have hierarchical sets, be sure to have an entry for each unique set name down to the root set for each item.  For example, if the item "abc123" is in the set "foo:bar:baz", you should have the following three rows in the oaisets table:
+
'''IMPORTANT:''' If you have hierarchical sets, be sure to have an entry for each unique set name down to the root set for each item.  For example, if the item "abc123" is in the set "foo:bar:baz", you must have the following three rows in the ''oaisets'' table:
 +
<pre>
   +--------+-------------+
   +--------+-------------+
   | id    | oaiset      |
   | id    | oaiset      |
Line 49: Line 131:
   | abc123 | foo        |
   | abc123 | foo        |
   +--------+-------------+
   +--------+-------------+
 +
</pre>
-
The id in the database is only the brief unique identifier and not the full OAI identifier.  For the OAI-PMH response, the identifier is created with the host name (repositoryIdentifier) from the configuration file (oai:host:id).
+
The id in the database is only the brief unique identifier and not the full OAI identifier.  For the OAI-PMH response, the identifier is created with the host name (''repositoryIdentifier'') from the XML configuration file (oai:host:id).
-
''Deleted records:'' If you would like to mark a record as deleted, just set the "oai_dc" field for that record to NULL in the database.  The OaiProvider will continue to return the header for this record but the header will have the "deleted" status attribute.
+
'''Deleted records:''' If you would like to mark a record as deleted, just set the "oai_dc" field for that record to NULL in the database.  The UMProvider will continue to return the header for this record and the header will have the ''deleted'' status attribute.  The UMProvider only supports marking an entire record as deleted, an individual format for a record cannot be deleted.
-
''Required perl modules:'' POSIX, XML::LibXML, DBI, Encode
+
'''Required perl modules:''' POSIX, XML::LibXML, DBI, Encode
===Synopsis===
===Synopsis===
Line 69: Line 152:
   my $args = {};
   my $args = {};
-
   foreach ( $query->param() ) { my @v = $query->param($_); $args->{$_} = "@v"; }
+
   foreach ( $query->param() ) { $args->{$_} = join($;, $query->param($_)); }
-
   my $op = new OaiProvider(
+
   my $op = new UMProvider(
       configFile => "oai_provider_conf.xml",
       configFile => "oai_provider_conf.xml",
       logFile    => "oai_provider.log",
       logFile    => "oai_provider.log",
Line 77: Line 160:
       arguments  => $args);
       arguments  => $args);
-
   if ( $op !~ /OaiProvider/ )
+
   if ( $op !~ /UMProvider/ )
   {
   {
       carp ($op);
       carp ($op);
Line 96: Line 179:
</pre>
</pre>
-
'''Available Methods'''
+
'''Public Methods''' available to CGI scripts using UMProvider:
 +
 
 +
*new(): Creates a new UMProvider object
-
*new():
 
<pre>
<pre>
   $op->new( hash_ref )
   $op->new( hash_ref )
Line 111: Line 195:
             setTableName => "my_oaisets",          ## default "oaisets"
             setTableName => "my_oaisets",          ## default "oaisets"
             DBDriver    => "SQLite",              ## default "mysql"
             DBDriver    => "SQLite",              ## default "mysql"
 +
            shortDate    => 1                      ## use the short date format YYYY-MM-DD (no time)
             DbUpdate    => "2007-12-25 10:00:01", ## default to check update time of table
             DbUpdate    => "2007-12-25 10:00:01", ## default to check update time of table
           }
           }
</pre>
</pre>
-
When the database table holding the OAI data is altered ('Update_time' changes for the table), any outstanding resumption tokens becomes invalid.  If your data is constantly updated, or the OAI table is frequently updated for some reason other than data changes, use the optional "DbUpdate" time in your CGI script.
+
'''Note:''' When the database table holding the OAI data is altered ('''Update_time''' changes for the table), any outstanding resumptionTokens become invalid.  If your data is constantly updated, or the OAI table is frequently updated for some reason other than data changes, use the optional '''DbUpdate''' time in your CGI script.
-
*ConnectToDb():
+
*ConnectToDb(): Connect to the database
<pre>$op->ConnectToDb( $db_user, $db_passwd, $db_name, $db_server )</pre>
<pre>$op->ConnectToDb( $db_user, $db_passwd, $db_name, $db_server )</pre>
-
returns 0 if failed to connect to DB
+
*$op->BuildResponse(): Builds a response based on the cgi parameters passed in from new().  If a problem is found, an OAI error node is created and added to the response object.
-
*$op->BuildResponse()
+
*$op->toString(): Serialize the response.
-
Builds a response based on the cgi parameters passed in from new(). If a problem is found, an OAI error node is created and added to the response object
+
-
*$op->toString();
+
*$op->DisconnectDb(): Disconnect from the database. The connection to the DB should be closed when the CGI script terminates.  This may be useful if your CGI script is doing something else after getting the response from the UMProvider object.
-
Serialize the response.
+
-
 
+
-
*$op->DisconnectDb();
+
-
The connection to the DB should be closed when the CGI script terminates.  This may be useful if your CGI script is doing something else after getting the response.
+
===Configuration===
===Configuration===
-
An XML configuration file must be passed to the UMProvider in new().  This file contains the information for the Identify verb, ListSets verb, and all possible metadataFormats (for ListMetadataFormats).
 
 +
The UMProvider configuration contains information about the repository for the Identify, ListSets and ListMetadataFormats OAI-PMH verb responses. This data is not really dynamic so it is just stored in an XML configuration file.
 +
 +
* <tt># cd $DLXSROOT/cgi/o/oai/</tt>
 +
* <tt># cp sample_config.xml oai_conf.xml</tt>
 +
* edit oai_config.xml
 +
 +
Information on the values you should use within the ''Identify'' section can be found [http://www.openarchives.org/OAI/openarchivesprotocol.html#Identify here].  Sets are optional in OAI-PMH so the ''<ListSets>'' section can be left empty.  More information about sets in OAI-PMH can be found [http://www.openarchives.org/OAI/openarchivesprotocol.html#ListSets here].  The ''<possibleMetadataFormats>'' section should list ''oai_dc'' and any other metadata formats that you choose to provide.  Information about metadata formats can be found [http://www.openarchives.org/OAI/openarchivesprotocol.html#ListMetadataFormats here].
 +
 +
Test the configuration with the following OAI requests:
 +
 +
<pre>
 +
  http://your.host.org/cgi/o/oai/oai?verb=Identify
 +
  http://your.host.org/cgi/o/oai/oai?verb=ListSets
 +
  http://your.host.org/cgi/o/oai/oai?verb=ListMetadataFormats
 +
</pre>
''Sample config file:''
''Sample config file:''
 +
<pre>
<pre>
         <?xml version="1.0" encoding="UTF-8"?>
         <?xml version="1.0" encoding="UTF-8"?>
Line 186: Line 281:
==LoadOai.pl==
==LoadOai.pl==
-
This script looks for XML files in the data dir and loads each record into the oai and oaisets DB tables.  Before using you must change the DB connection settings at the top of this script:
+
This is a generic example script for loading data.  The script looks for XML files in the data directory (''data dir'') and loads each record into the ''oai'' and ''oaisets'' tables.  Before using this script you must change the DB connection settings at the top:
<pre>
<pre>
Line 201: Line 296:
   -s oai set (optional)
   -s oai set (optional)
   -h (help: message printed)
   -h (help: message printed)
-
   -v (generates verbose output which is stored in the loading Log)
+
   -v (verbose output)
</pre>
</pre>
-
The only required arguments are the data dir (-d).
+
The only required argument is the data dir (-d).
-
All of the data within and including the <metadata> element is loaded into the DB under the specified format.  This data is not validated or checked in any way other than to make sure it is well formed XML.
+
All of the data within and including the <metadata> element is loaded into the DB under the specified format.  The script should be able to identify ''oai_dc'' and ''marc21'' but any other metadata format may require code modifications.  This data is not validated or checked in any way other than to make sure it is well formed XML.
-
If the oai set (-s) is used, the setSpec is not checked in the /record/header/ element.
+
If the oai set option (-s) is used, the setSpec is ignored in the '''/record/header/''' element for all records being loaded.
-
For the identifier, the script only cares about everything after the last ":" (oai:host:id).
+
For the identifier, the script only uses anything after the last ":" (oai:host:id).  The "oai:host:" part is added on to the identifier by the UMProvider.
-
*DB format (see above under UMProvider)
+
*DB Format (see above under UMProvider)
*XML Format
*XML Format
-
The XML files must have the OAI record elements wrapped in a <records> element.  By default, this script will look for record elements wrapped in a <records> tag.  If you have the data nested some other way, just change $recordXpath to the correct xpath to find the record elements.  There can be multiple record elements in a single file as well as multiple ".xml" files in the data directory.  Here’s an example XML file:
+
The XML files must have the OAI record elements wrapped in a '''<record>''' element (just like OAI-PMH 2.0).  By default, this script will look for record elements wrapped in a '''<records>''' tag.  If you have the data nested some other way, just change the <code>$recordXpath</code> value in the script to the correct xpath for finding the '''<record>''' elements.  There can be multiple record elements in a single file as well as multiple XML files in the data directory.  Here’s an example XML file:
<pre>
<pre>
Line 222: Line 317:
       <record>
       <record>
         <header>
         <header>
-
           <identifier>oai:some.host..edu:id-1234</identifier>
+
           <identifier>oai:some.host.edu:id-1234</identifier>
-
          <datestamp>2007-10-22T15:43:11Z</datestamp>
+
           <setSpec>foo</setSpec>
           <setSpec>foo</setSpec>
         </header>
         </header>
Line 232: Line 326:
     </records>
     </records>
</pre>
</pre>
 +
 +
==Record Loader (PHP)==
 +
 +
Lisa Brooks at [http://issuelab.org IssueLab] has created an open source PHP script for loading records into a database for the UMProvider.  This can be used as an alternative to the Perl script (LoadOai.pl) described above. 
 +
 +
The script along with detailed instruction can be found here:  http://harvest.issuelab.org/software.php

Current revision

Main Page > Ancillary Resources > OAI Provider


Contents

[edit] Overview

Below you will find the documentation for using the UMProvider and LoadOai.pl, in order to make metadata available to the world using the OAI Protocol for Metadata Harvesting (OAI-PMH). These tools, along with the OAI Harvester, are open source and available for download from Source Forge as well as included in DLXS release 14.

DLXS also offers an older data provider tool-- broker20, which uses XPAT and collmgr to function.

[edit] Setup within DLXS

If you are using the UMProvider within a DLXS installation you will be prompted to configure your UMProvider during the installation step (running Installer.pl). If you are not ready to install your OAI provider when you are installing DLXS, you can come back and just install the UMProvider any time. To install/configure only the UMProvider use the Installer.pl with the -f option like:


  # cd $DLXSROOT/bin/i/installer/
  # perl ./Installer.pl -f $DLXSROOT/cgi/o/oai/oai.cfg

Release_15 If you are not using dlxs, you may not have a symlink of the perl interpreter set, therefore, you may need to change the location of the perl interpreter in $DLXSROOT/cgi/o/oai/oai.

Release_15 Make sure that you have installed the following modules from CPAN


   Switch
   Date::Parse
   Config::Tiny
   XML::LibXML
   Mail::Mailer

If you have the UMProvider configured correctly, and you have updated your database with the MySQL dump file that is provided with the release, you will not need to create the database tables or create/modify the CGI script. You should now be able to make the following requests and get proper OAI-PMH XML responses:


  http://your.host.org/cgi/o/oai/oai?verb=Identify
  http://your.host.org/cgi/o/oai/oai?verb=ListSets
  http://your.host.org/cgi/o/oai/oai?verb=ListRecords&metadataPrefix=oai_dc

If these look good, the DLXS configuration steps are complete. You can skip down to the OAI #Configuration step below where you describe your OAI provider. After that you are ready to load data.

[edit] Setup outside of DLXS

Basic steps for setting up an OAI-PMH 2.0 Provider using UMProvider.pm downloaded from Source Forge:

  1. Create the oai and oaisets tables in MySQL (or another database). See CREATE TABLE examples and explanation below.
  2. There is a script described below to assist you in loading the metadata but there is no way around the fact that you need to generate Dublin Code (DC) metadata for each record. One option is to wrap the oai_dc records in <metadata> tags and put them into the format for loading with the LoadOai.pl (details below). If you have another mechanism for loading the data that you would prefer to use, create these records and load everything, including the <metadata> element into the oai_dc field of the oai table. For both options you will also need to have a unique identifier for each record as well as the set(s) information.
    We also have scripts and XSLT style sheets that should help convert existing DLXS collection to DC for loading into the database. Information for that can be found here: DLXS_to_DC.
  3. Copy sample_config.xml to config.xml and edit for your setup. Be sure to change: repositoryName, baseURL, adminEmail, repositoryIdentifier, sampleIdentifier, the list of sets and list of possible metadata formats.
  4. Place oai (the CGI script), config.xml and UMProvider.pm in a CGI directory on a web server.
  5. Edit oai (the CGI script) to add
    1. database connection settings,
    2. pointer to your config.xml, and
    3. log file location (details below).
  6. Test OAI provider:
   http://your.host.org/cgi-bin/oai?verb=Identify
   http://your.host.org/cgi-bin/oai?verb=ListRecords&metadataPrefix=oai_dc

[edit] UMProvider

UMProvider: OAI-PMH 2.0 Provider Perl module. The UMProvider relies heavily on [libxml] (XML::LibXML) for creating and parsing XML data and a database for storage.

UMProvider requires that you have pre-formed oai_dc metadata in a database. The default database is mysql but any DBI.pm supported database should work. The data must be stored in two tables with the following required columns:

  +----+--------------------+
  | table: oai              |
  +----+-----------+--------+
  | id | timestamp | oai_dc |
  +----+-----------+--------+

  +----+-----------+
  | table: oaisets |
  +----+-----------+
  | id | oaiset    |
  +----+-----------+

The second table for OAI set information (oaisets) is optional just like the use of sets in OAI-PMH 2.0.

If you would like to provide additional metadata formats such as marc21 or mods, add these columns after oai_dc in the first (main) table. The column name must match the metadata format. Release_15 You must also create the tables oai_marc, and oai_mods. Here are example create table statements:

  CREATE TABLE oai (id VARCHAR(50) PRIMARY KEY,
    timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    oai_dc MEDIUMBLOB,
    mods   MEDIUMBLOB,
    marc21 MEDIUMBLOB,
    PRIMARY KEY  (id),
    KEY timestamp (timestamp),
    KEY oai_time_id (timestamp, id));

  CREATE TABLE oaisets (id VARCHAR(50), 
    oaiset VARCHAR(32), 
    PRIMARY KEY (id,oaiset),
    KEY 'oaiset' (oaiset));

  CREATE TABLE `oai_marc` (
    id  varchar(150) NOT NULL default '',
    timestamp  timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
    marc21  mediumblob,
    PRIMARY KEY  (id),
    KEY timestamp (timestamp),
    KEY oai_time_id (timestamp,id));

  CREATE TABLE `oai_mods` (
    id` varchar(150) NOT NULL default '',
    timestamp  timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
    mods  mediumblob,
    PRIMARY KEY  (id),
    KEY timestamp (timestamp),
    KEY oai_time_id (timestamp,id));

The id values in the oai table and the oaisets tables must match. The id in the oai table must be unique but since an item can exist in multiple oai sets, that id can repeat in the oaisets table. Records do not have to be assigned to a set so it is possible that an item in the oai table does not exist in the oaisets table.

IMPORTANT: If you have hierarchical sets, be sure to have an entry for each unique set name down to the root set for each item. For example, if the item "abc123" is in the set "foo:bar:baz", you must have the following three rows in the oaisets table:

  +--------+-------------+
  | id     | oaiset      |
  +--------+-------------+
  | abc123 | foo:bar:baz |
  | abc123 | foo:bar     |
  | abc123 | foo         |
  +--------+-------------+

The id in the database is only the brief unique identifier and not the full OAI identifier. For the OAI-PMH response, the identifier is created with the host name (repositoryIdentifier) from the XML configuration file (oai:host:id).

Deleted records: If you would like to mark a record as deleted, just set the "oai_dc" field for that record to NULL in the database. The UMProvider will continue to return the header for this record and the header will have the deleted status attribute. The UMProvider only supports marking an entire record as deleted, an individual format for a record cannot be deleted.

Required perl modules: POSIX, XML::LibXML, DBI, Encode

[edit] Synopsis

Here is an example CGI script using UMProvider:

  use OaiProvider;
  use CGI;
  use CGI::Carp;

  my $query  = CGI->new();
  my @params = $query->param();
  my $url    = $query->url();

  my $args = {};
  foreach ( $query->param() ) { $args->{$_} = join($;, $query->param($_)); }

  my $op = new UMProvider(
      configFile => "oai_provider_conf.xml",
      logFile    => "oai_provider.log",
      url        => $url,
      arguments  => $args);

  if ( $op !~ /UMProvider/ )
  {
      carp ($op);
      print $query->header(-status => 500);
      exit;
  }

  if ( ! $op->ConnectToDb( $db_user, $db_passwd, $db_name, $db_server ) )
  {
      carp ("failed ConnectToDb: $db_user, $db_passwd, $db_name, $db_server");
      print $query->header(-status => 500);
      exit;
  }

  $op->BuildResponse();
  print $query->header(-type => ’text/xml’, -charset => ’utf-8’, -status => 200);
  print $op->toString();

Public Methods available to CGI scripts using UMProvider:

  • new(): Creates a new UMProvider object
  $op->new( hash_ref )
          { configFile => "oai_provider_conf.xml",
            logFile    => "oai_provider.log",
            url        => "http://some.url.org/OAI",
            arguments  => $args, ($args->{verb} = "ListSets")

            ## optional
            maxItems     => 500,                   ## default 100
            tableName    => "my_oai",              ## default "oai"
            setTableName => "my_oaisets",          ## default "oaisets"
            DBDriver     => "SQLite",              ## default "mysql"
            shortDate    => 1                      ## use the short date format YYYY-MM-DD (no time)
            DbUpdate     => "2007-12-25 10:00:01", ## default to check update time of table
          }

Note: When the database table holding the OAI data is altered (Update_time changes for the table), any outstanding resumptionTokens become invalid. If your data is constantly updated, or the OAI table is frequently updated for some reason other than data changes, use the optional DbUpdate time in your CGI script.

  • ConnectToDb(): Connect to the database
$op->ConnectToDb( $db_user, $db_passwd, $db_name, $db_server )
  • $op->BuildResponse(): Builds a response based on the cgi parameters passed in from new(). If a problem is found, an OAI error node is created and added to the response object.
  • $op->toString(): Serialize the response.
  • $op->DisconnectDb(): Disconnect from the database. The connection to the DB should be closed when the CGI script terminates. This may be useful if your CGI script is doing something else after getting the response from the UMProvider object.

[edit] Configuration

The UMProvider configuration contains information about the repository for the Identify, ListSets and ListMetadataFormats OAI-PMH verb responses. This data is not really dynamic so it is just stored in an XML configuration file.

  • # cd $DLXSROOT/cgi/o/oai/
  • # cp sample_config.xml oai_conf.xml
  • edit oai_config.xml

Information on the values you should use within the Identify section can be found here. Sets are optional in OAI-PMH so the <ListSets> section can be left empty. More information about sets in OAI-PMH can be found here. The <possibleMetadataFormats> section should list oai_dc and any other metadata formats that you choose to provide. Information about metadata formats can be found here.

Test the configuration with the following OAI requests:

  http://your.host.org/cgi/o/oai/oai?verb=Identify
  http://your.host.org/cgi/o/oai/oai?verb=ListSets
  http://your.host.org/cgi/o/oai/oai?verb=ListMetadataFormats 

Sample config file:

        <?xml version="1.0" encoding="UTF-8"?>
        <oai_config>
          <Identify>
            <repositoryName>Your Repository</repositoryName>
            <baseURL>http://your.host.edu/OAI</baseURL>
            <protocolVersion>2.0</protocolVersion>
            <adminEmail>you@your.edu</adminEmail>
            <earliestDatestamp/>
            <deletedRecord>transient</deletedRecord>
            <granularity>YYYY-MM-DDThh:mm:ssZ</granularity>
            <description>
              <oai-identifier xmlns="http://www.openarchives.org/OAI/2.0/oai-identifier"
                  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                  xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai-identifier
                  http://www.openarchives.org/OAI/2.0/oai-identifier.xsd">
                <scheme>oai</scheme>
                <repositoryIdentifier>your.host.edu</repositoryIdentifier>
                <delimiter>:</delimiter>
                <sampleIdentifier>oai:your.host.edu:000000001</sampleIdentifier>
              </oai-identifier>
            </description>
          </Identify>
          <ListSets>
            <set>
              <setSpec>foo</setSpec>
              <setName>All things of foo</setName>
            </set>
            <set>
              <setSpec>bar</setSpec>
              <setName>All things of bar</setName>
            </set>
          </ListSets>
          <PossibleMetadataFormats>
            <metadataFormat>
              <metadataPrefix>oai_dc</metadataPrefix>
              <schema>http://www.openarchives.org/OAI/2.0/oai_dc.xsd</schema>
              <metadataNamespace>http://www.openarchives.org/OAI/2.0/oai_dc</metadataNamespace>
            </metadataFormat>
            <metadataFormat>
              <metadataPrefix>marc21</metadataPrefix>
              <schema>http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd</schema>
              <metadataNamespace>http://www.loc.gov/MARC21/slim</metadataNamespace>
            </metadataFormat>
          </PossibleMetadataFormats>
        </oai_config>

[edit] LoadOai.pl

This is a generic example script for loading data. The script looks for XML files in the data directory (data dir) and loads each record into the oai and oaisets tables. Before using this script you must change the DB connection settings at the top:

  my $dbUser   = "user";
  my $dbPasswd = "passwd";
  my $dbName   = "foo";
  my $dbServer = "bar";

The available arguments for the script are:

  -d data dir (place to find the XML data)
  -s oai set (optional)
  -h (help: message printed)
  -v (verbose output)

The only required argument is the data dir (-d).

All of the data within and including the <metadata> element is loaded into the DB under the specified format. The script should be able to identify oai_dc and marc21 but any other metadata format may require code modifications. This data is not validated or checked in any way other than to make sure it is well formed XML.

If the oai set option (-s) is used, the setSpec is ignored in the /record/header/ element for all records being loaded.

For the identifier, the script only uses anything after the last ":" (oai:host:id). The "oai:host:" part is added on to the identifier by the UMProvider.

  • DB Format (see above under UMProvider)
  • XML Format

The XML files must have the OAI record elements wrapped in a <record> element (just like OAI-PMH 2.0). By default, this script will look for record elements wrapped in a <records> tag. If you have the data nested some other way, just change the $recordXpath value in the script to the correct xpath for finding the <record> elements. There can be multiple record elements in a single file as well as multiple XML files in the data directory. Here’s an example XML file:

    <?xml version="1.0" encoding="UTF-8"?>
    <records>
      <record>
        <header>
          <identifier>oai:some.host.edu:id-1234</identifier>
          <setSpec>foo</setSpec>
        </header>
        <metadata> [ ... ]
        </metadata>
      </record>
      <record> [ ... ] </record>
    </records>

[edit] Record Loader (PHP)

Lisa Brooks at IssueLab has created an open source PHP script for loading records into a database for the UMProvider. This can be used as an alternative to the Perl script (LoadOai.pl) described above.

The script along with detailed instruction can be found here: http://harvest.issuelab.org/software.php

Personal tools