DeletedRecords
From oaibp
Main Page >> Data Provider Implementations
Contents |
[edit] Deleted Records
[edit] Protocol Definition
The OAI Protocol states that "Repositories must declare one of three levels of support for deleted records in the deletedRecord element of the Identify response:
- no [<deletedRecord>no</deletedRecord>] - the repository does not maintain information about deletions. A repository that indicates this level of support must not reveal a deleted status in any response.
- persistent [<deletedRecord>persistent</deletedRecord>] - the repository maintains information about deletions with no time limit. A repository that indicates this level of support must persistently keep track of the full history of deletions and consistently reveal the status of a deleted record over time.
- transient [<deletedRecord>transient</deletedRecord>] - the repository does not guarantee that a list of deletions is maintained persistently or consistently. A repository that indicates this level of support may reveal a deleted status for records.
See Identify Response Example 1 for a full example of an Identify response with persistent support of deleted records.
If there is support for deleted records (whether persistent or transient) and an OAI record is deleted, the datestamp must be the date and time that the record was deleted, and the OAI header must contain an attribute status="deleted" (i.e. <header status="deleted"> ) and must not include metadata or about containers. The OAI header might look like this:
<header status="deleted">
<identifier>oai:arXiv.org:hep-th/9801010</identifier>
<datestamp>1999-02-23</datestamp>
<setSpec>physic:hep</setSpec>
<setSpec>math</setSpec>
</header>
See Deleted Record Example 1 in XML for a full example of a ListRecords response containing deleted records.
Note that the deleted status is a property of an OAI record (OAI record = Unique ID of item + metadataPrefix + datestamp). If, for example, a repository provides records in oai_dc, oai_marc, and marc21 and decides to delete the records in oai_marc, it can mark these records deleted without affecting the records in oai_dc and marc21. But if the item (with the unique id) is deleted, the records disseminated in all three metadata formats should be marked deleted.
See http://www.openarchives.org/OAI/openarchivesprotocol.html#DeletedRecords for the protocol documentation on deleted records.
As of October 2005, 55% (411) of 2.0 compliant sites did not support deleted records, 34% (256) of 2.0 compliant sites support persistent information about deleted records, and 11% (86) of 2.0 compliant sites support transient information about deleted records (from the Experimental OAI Registry at UIUC).
[edit] Best Practices for Deleted Records
Best practice is to support persistent information about deleted records. This allows service providers to track which records have been deleted and purge the appropriate records from their service. Maintenance of persistent information about deleted records has an added benefit of helping to ensure that a repository does not reuse OAI identifiers (because they continue to be used for the deleted record).
If maintenance of persistent information about deleted records is not possible, OAI repositories should consistently maintain transient information about deleted records. Transient information should be maintained for a minimum of six months to allow service providers who harvest sites irregularly to harvest the information about deleted records.
[edit] Significance of Support for Deleted Records
The OAI Protocol is designed to support incremental harvesting by service providers. This means that service providers are able to harvest only those records which have been added, modified, or deleted since the last harvest of the repository. In order to determine whether a record has been added, modified, or deleted since the last harvest, the protocol relies on a change in the datestamp (see the Datestamps section). See Selective Harvesting in the OAI protocol documentation for more information about this concept.
However, if a repository does not support information (whether persistent or transient) about deleted records and does in fact delete records, a service provider conducting an incremental harvest has no way to know that these records have been deleted. This has two specific implications:
- The records which have been deleted from the data provider's repository will still appear within the service provider's end product.
Example: Repository A has decided to withdraw access to 100 digitized photographs because of a copyright dispute. It deletes the metadata items from its database and the records are no longer disseminated via its OAI repository. Repository A does not support information about deleted records. When Service Provider 1 incrementally harvests Repository A for records that have changed since the last harvest, the ListRecords request (with a from and until argument) returns only new records added since the last harvest. Service Provider 1 adds the new records to its service but does not make any other changes. An end user searching Service Provider 1's database finds an interesting metadata record, clicks the URL pointing to the content described by the metadata, but is directed to an error page because that content has been deleted. This reflects badly both on Service Provider 1 and Repository A.
- The service provider will need to conduct regular full harvests in order to ensure that its data matches the repository.
While conducting periodic full harvests is generally good practice for service providers, conducting regular full harvests can present a scalablity problem. If a repository does not support information about deleted records, regular full harvests are, unfortunately, essential. Regular re-harvests are the only way to ensure that the records the service provider has match the records the OAI repository is providing.
If a repository supports persistent or transient information for deleted records, a service provider will be able to purge the appropriate records from its database or service and maintain greater consistency with the original repository. See also the Datestamps section for more information about the issue of maintaining consistency between service and data providers.
