From oaibp

Jump to: navigation, search

Main Page >> Data Provider Implementations


[edit] Sets

Best practices for use of sets is, in many ways, an extremely fluid and evolving area. The section below attempts to distinguish between what is an accepted best practice, and what is an area that is either still under discussion or may be dependent on the domain of which a data provider is a member.

[edit] Protocol Definition

The OAI protocol defines a set as "an optional construct for grouping items for the purpose of selective harvesting." Set structure may be flat, hierarchical, or overlapping. Items may belong to more than one set. Each set must have a setSpec which is used within OAI requests and a setName which is a human-readable string. Each set may also have a setDescription which is an optional and repeatable container that can hold XML-encoded data about the set.

The OAI protocol specifies that the setSpec is a "colon [:] separated list indicating the path from the root of the set hierarchy to the respective node. Each element in the list is a string consisting of any valid URI unreserved characters, which must not contain any colons [:]. Since a setSpec forms a unique identifier for the set within the repository, it must be unique for each set. Flat set organizations have only sets with setSpec that do not contain any colons [:]."

As stated above, hierarchical sets are indicated to the service provider through a colon separated list in the setSpec.

If sets are implemented in an OAI repository, set membership (the setSpec) must be included in the header of records returned in response to the ListIdentifiers, ListRecords and GetRecord requests.

The OAI protocol explicitly states that data providers may have records in an OAI repository which do not belong to any set. The protocol does not specify how sets should be defined, organized, or arranged within a repository, nor does it make any recommendations about the number of sets or the number of records within a set (i.e. a set can contain no records).

If an OAI repository has implemented sets, a service provider may still choose to harvest all records (ListRecords) or identifiers (ListIdentifiers) without reference to the sets. A data provider may not require that its repository be harvested only by set.

See the OAI Protocol on Sets and the OAI Implementation Guidelines.

[edit] Service Provider Usage of Sets

Service providers can use sets to selectively harvest sets of records that are appropriate for their service. For example, the National Science Digital Library might harvest only those sets that include science, engineering, and mathematics oriented records. The IMLS Digital Content Gateway might harvest only those sets and repositories that include metadata describing material in collections that have been funded by the Institute of Museum and Library Services. Sets can help a service provider avoid harvesting an entire OAI repository which can reduce the workload on the data provider's server.

Sets and set descriptions have the potential to aid service providers further. The use of set descriptions in particular can provide both technical and descriptive information about the set of metadata records. This information might help a service provider not only determine whether to harvest a set, but also how often to incrementally harvest or re-harvest the records and what metadata formats the records are available in. In addition, the information could potentially be used to provide contextual information about the resources described by the metadata to the end-user.

Service providers can run into a number of obstacles when using sets to selectively harvest. These include:

  • Interpreting how a repository has organized sets and determining which sets to harvest
Issue: setName not human understandable and/or no setDescription provided. Set name (because it is required) is often used to interpret the content of the set and whether its items should be harvested.
Issue: Large number of sets to sort through. As of October 6, 2005 the largest number of sets in a single OAI repository was 4806. The average number of sets was 68 and the median number of sets was 5.
  • Knowing when there are records that belong to no sets
Issue: Items that belong to no sets are included in the OAI repository.
  • Knowing when there are empty sets
Issue: Data provider exposes sets with no records. As of August 31, 2004, there are currently cases where Data Providers expose sets with no items. According to a report from the OAI Registry at UIUC, there are currently 94 Data Providers that expose sets with no items. The number of empty sets per Data Provider ranges from a high of 1048 to a low of 1. The average number of empty sets is 76, and the median is 12.
  • Understanding relationships between sets
Issue: Relationships between sets are not expressed. While there is a mechanism to express relationships between hierarchical sets, there is not a mechanism to express relationships between overlapping sets. The only way to know which items belong to multiple non-hierarchical sets is to harvest the identifiers or records which contain the header information.
  • Knowing how many records there are within a set before harvesting
Issue: Not expressing how many records are within a set which can be expressed via a completeListSize attribute in a resumptionToken or within the set description.
  • Knowing when a set structure has been substantially changed
Issue: Changes in a set structure has not been communicated as suggested in the Managing the Repository Lifecycle section.

Many of these obstacles point to both a need for data providers to include documentation about sets and a need for data and service providers to communicate with one another around set issues.

[edit] Best Practices for Sets

When should sets be used?

The use of sets within an OAI repository is optional, and is simply a way to organize metadata to support selective harvesting by service providers. Because service providers have many different criteria for what metadata they want to harvest and how they store and manipulate metadata once harvested, it is unlikely that all sets will be useful for all service providers. For this reason, if a data provider is interested in inclusion in a specific harvesting service, it may want to communicate with the service provider to understand 1) whether sets would be beneficial for the service provider, and 2) if so, what sort of organization would be most useful.

That said, if the OAI repository contains a large number of items describing resources from a variety of different collections or if a data provider wishes to distinguish between one group of items and another group, it is a best practices to use the set concept in the OAI protocol to do this.

How should sets be organized?

The protocol does not specify how sets should be defined, organized or arranged within a repository, and current practice varies widely. Sets are organized according to any number of factors: subject, type of material, a traditional library or museum collection, publication status, originating institution, internal departments, and access restrictions. Set organization may be dictated by the program or software used by the data provider as well as by internal workflow and organization.

If a data provider is interested in inclusion in a specific harvesting service, it may want to communicate with that service provider to understand what sort of organization of sets is most useful to the service provider. In addition it may want to understand how other data providers in its domain are organizing their OAI repositories. The Deutsche Initiative für Netzwerkinformation (DINI), for example, has made some recommendations (PDF document) for set organization for the higher education community in Germany which suggest using subject classification, publication type, and/or document type as possible organizing structures.

There is no single best practice for the organization of sets. The most realistic recommendation is that data providers organize sets in a way which best meets the needs of their primary service provider and can be easily done within their own internal workflows.

However, it can be useful to organize the metadata items into sets according to the collections of resources they represent. The concept of collections might vary considerably. However, the internal conception of collections that data providers have is usually helpful to service providers and allows them to selectively harvest. Many times, the concept of collection actually corresponds to a unity of topics, material type and person in charge of the original collection. Most of the time, each collection has its own home page in the data provider's local system, for example the text collections at the University of Michigan. Collections can help provide crucial contextual information to supplement the metadata within the set. See Appropriate Representations of Resources in the Best Practices for Shareable Metadata section for more information about context.

Technical issues around the organization of sets

Relationships between sets

As stated above, set structure can be flat, strictly hierarchical, or overlapping. Items may belong to more than one set; this is typically indicated in the header container of an OAI record. For example:


Independent of what is in the header container of an OAI record, hierarchical sets can also be indicated through the use of colons in the setSpec. For example: <setSpec>Functions:Algebra</setSpec> should indicates that there is a <setSpec>Functions</setSpec> set that includes <setSpec>Functions:Algebra</setSpec> which can be harvested separately from <setSpec>Functions<setSpec>. However, not all data providers follow this convention. If sets are hierarchical, it is a best practice to use the : convention indicated in the OAI protocol. If there are hierarchical sets and the : convention is not used, it is a best practice to document the relationship between sets in the set description (discussed further below).

Sets may also be overlapping, but not strictly hierarchical. For example, the University of Michigan repository has organized its sets by collection (all records within a set describe resources belonging to a specific collection), and has provided specific sets for the use of specific service providers, e.g., it also provides a set corresponding to all scientific material within its collections. There is not a formalized way to indicate overlapping sets within the protocol as there is with hierarchical sets. It is a best practice to document overlapping sets in the set description (discussed further below).

Empty sets

Whenever possible, data providers should not expose empty sets. In some cases exposure of empty sets is a function of the data provider software in use, but whenever possible these should be suppressed in the actual OAI repository.

Set membership

Within the OAI construct, items belong to sets. (See the first part of OAI Identifiers for a brief discussion of items and records). The implications of this is that a set may contain items which are disseminated in only one metadata format and other items which are disseminated in two or more metadata formats, depending on the data provider implementation. There is not a formal way within the OAI protocol to indicate what metadata formats the items in a set are available in. It is a best practice that items within sets should all be able to be disseminated in the same metadata format. Items which are disseminated in different metadata formats should be organized into a distinct set. For example, if a data provider generally exposes all items in a set in oai_dc, MARC, and MODS, but has a new batch of items to add that are only available in oai_dc and MODS, these items should be placed into a new set rather than added to the other one. If this is not possible, the data provider should add explicit documentation to the set description noting the inconsistency in metadata format availability.

If a repository has implemented sets, it is a best practice for all items to belong to at least one set; that is, there should be no items which don't belong to any set. A general set can be created for items which cannot be categorized, "All uncategorized technical reports", for example. If a data provider does include items which do not belong to no sets, it is useful to note this information in the repository description within the Identify response. Service providers may harvest only by set and may never realize that additional records are available unless such information is noted.

How should a data provider describe sets?


It is best practice that the setName should not only be human readable, but human understandable. setNames such as MWB01 or c72 may have some meaning to a data provider, but they will not to the setName's primary audience, the service provider. Service providers rely on the setName to interpret the set particularly if the setSpec is not human readable/understandable or if there is no set description.


If a repository has sets implemented it is a best practice to include a setDescription. This allows service providers to understand what items are in a set and how it is organized. It also allows registries such as the UIUC registry to enable better keyword searching from responses to OAI requests. However, what precisely should be included within the setDescription container is still an open issue.

As of October 1 2005, 109 (14%) of 757 OAI data providers using the 2.0 version of the protocol included a setDescription container (see for a current list). All but one use Dublin Core for the set description. Below is a rough breakdown of how set description is used:

  • A link to the top page of the journal issue containing the records in the set (dc:description element only): 49 (45%)
  • Description of the resources described by the metadata contained in the set (dc:description element only): 45 (41%)
  • <setDescription> included but is either empty or contains the letter 'c': 9 (8%)
  • Title or publisher of the collection of resources described by the metadata contained in the set and a URL to the collection homepage: 3 (3%)
  • Description of the resources described by the metadata contained in the set (more than dc:description): 2 (5%)
  • Record Count only: 1 (1% )

Note that the majority of those that use the setDescription container describe the collection of resources represented by the metadata contained in the set, but do not strictly describe the set itself as a collection of metadata records. Of the 109 data providers who use the setDescription container, only 5 (5%) include descriptive information that contains technical information about the set itself (e.g. metadata formats available). The distinction between the description of the resources represented by the metadata in the set and the description of the set itself as a collection of metadata records and as a technical mechanism in and of itself is an important one. Service providers often need both types of information; knowing the type of resources that are represented by the metadata in the set is important for selection purposes, while knowing what metadata formats are available within a specific set and how often it is updated is very useful to production harvesting activities, particularly if that information is encoded in a machine readable format.

The Best Practices working group does recommend making a distinction between set description (as in the collection of metadata records themselves and the set's relationship to other sets) and collection description (as in the collection of resources represented by the metadata records in the set). However, placement of set description information, and how it is associated with collection description of the resources is not by any means resolved. A discussion paper developed at UIUC (in consultation with the working group but not strictly under its aegis) proposes a method for handling set description, but this is still under discussion and revision.

The following is useful information to include in a setDescription container:

  • The formats in which the data can be harvested. For a repository that offers sets using different metadata formats, the data provider should indicate which metadata format(s) the records in the set adhere to. It is important for service providers interested in harvesting more than unqualified Dublin Core to know which formats are available for each set. As stated above, there is no formal way in the protocol to do this.
  • Accrual periodicity (the frequency with which items are added to a collection) can help a service provider schedule harvests.
  • The relation to other sets. If sets overlap, the data provider could indicate how the sets interact by using <dcterm:isPartOf> in setDescription.
  • An approximate number of records contained in the set. This can also be encoded in the resumptionToken.
  • If the metadata within the set represents a collection (however defined) a collection description is very important for service providers for selection purposes. For example, a service provider may only want to harvest metadata records focused on American history; it may be possible to use collection description to determine which sets to harvest.

In all cases, standard encoding schemes and/or controlled vocabularies should be used if possible in order to facilitate machine processing of the information.

[edit] Examples

The following are two examples of set descriptions.

A set description may be a simple <dc:description> field such as this one in the Library of Congress American Memory data provider. This is embedded in a much more extensive description of the resources represented by the metadata in the set:

    <oai_dc:dc xsi:schemaLocation="">     
    <dc:title xml:lang="en">Records for California As I Saw It (books)</dc:title>     
    <dc:creator>Library of Congress</dc:creator>     
    <dc:description>Transcribed text with illustrations of 186 works documenting the 
    formative era of California's history through eyewitness accounts. The collection 
    covers the dramatic decades between the Gold Rush and the turn of the twentieth century. 
    It captures the pioneer experience; encounters between Anglo-Americans and the diverse 
    peoples who had preceded them; the transformation of the land by mining, ranching, 
    agriculture, and urban development; the often-turbulent growth of communities and cities; 
    and California's emergence as both a state and a place of uniquely American dreams.</dc:description>     
    <dc:description>Set characteristics for calbkbib:  Source records are MARC 
    (from LC catalog); MODS or oai_dc records are dynamically generated using generic 
    transformation when harvested. dct:accrualPolicy: Closed. Contains about 200 records. 
    Records in set calbkbib are also in set lcbooks.</dc:description>     
    <dc:type xml:lang="en">text</dc:type>     
    <dc:type xml:lang="en">books</dc:type>     
    <dc:type xml:lang="en">printed materials</dc:type>     
    <dc:type xml:lang="en">collection</dc:type>     
    <dc:coverage xml:lang="en">1849-1900</dc:coverage>     
    <dc:coverage xml:lang="en">California</dc:coverage>     
    <dc:subject xml:lang="en">Ethnic groups--California.</dc:subject>     
    <dc:subject xml:lang="en">Law and politics--California.</dc:subject>     
    <dc:subject xml:lang="en">California--History.</dc:subject>     
    <dc:subject xml:lang="en">California--Biography.</dc:subject>     
    <dc:subject xml:lang="en">California--Gold discoveries.</dc:subject>     
    <dc:contributor xml:lang="en">Library of Congress, General Collections</dc:contributor>     
    <dc:rights xml:lang="en"></dc:rights>     
    <dc:relation xml:lang="en"></dc:relation>     

Or a <setDescription> field may be employed as in a sample record from the the Michigan State University repository (Note: no particular collection description schema has been named here since this is still evolving. See the Dublin Core Collection Description App Profile for the closest to canonical example.:

    <dc:title>Feeding America: The Historic American Cookbook Project</dc:title>     
    <dcterm:extent>77 records</dcterm:extent>     
    <dcterm:accessrights>No restrictions</dcterm:accessrights>     
    <dc:rights>No restrictions</dc:rights>     
    <dcterm:abstract >Records for cookbooks digitized as part of the Feeding America: 
    Historic American Cookbooks Project</dcterm:abstract>     
    <dc:description>Simple Dublin Core records created from documentation for each cookbook; 
    oai_dc subject uses LCSH; oai_dc:type uses DCMIType vocabulary; oai_dc:coverage (1) uses TGN</dc:description>     
    <dc:language xsi:type="dct:ISO639-2">eng</dc:language>     
    <gen:isLocatedAt>Michigan State University Libraries Digital & Multimedia Center</gen:isLocatedAt>     
    <gen:isAccessedVia xsi:type="dct:URI"> </gen:isAccessedVia>     
    <dc:title>Feeding America: The Historic American Cookbook Project</dc:title>     
    <dcterm:abstract>Online collection of some of the most important and influential American cookbooks 
    from the late 18th to early 20th century.</dcterm:abstract>     
    <dcterm:extent>77 items</dcterm:extent>     
    <dc:language xsi:type="ISO639-2">eng</dc:language>     
    <dc:type xsi:type="cldtype">Collection of Texts</dc:type>     
    <dcterm:accessRights>No restrictions</dcterm:accessrights>     
    <dcterm:accrualPolicy xsi:type="DCCDAccrualPeriodicity">Passive</dcterm:accrualPolicy>     
    <dc:subject xsi:type="LCSH">Cookery, American</dc:subject>     
    <dc:subject xsi:type="LCSH">Cookery -- United States -- 19th century</dc:subject>     
    <dc:subject xsi:type="LCSH">Cookery -- United States -- 20th century</dc:subject>     
    <dc:creator>Michigan State University Libraries. Digital & Multimedia Center.</dc:creator>     
    <gen:isLocatedAt>Michigan State University, 100 Library, East Lansing, MI, 48224, USA</gen:isLocatedAt>
Personal tools