The DSpace platform supports the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) version 2.0 as a data provider. This is accomplished using the OAICat framework from OCLC.
The DSpace build process builds a Web application archive, [dspace-source]/build/dspace-oai.war
), in much the same way as the Web UI build process described above. The only differences are that the JSPs are not included, and [dspace-source]/etc/oai-web.xml
is used as the deployment descriptor. This 'webapp' is deployed to receive and respond to OAI-PMH requests via HTTP. Note that typically it should not be deployed on SSL (https:
protocol). In a typical configuration, this is deployed at dspace-oai
, for example:
http://dspace.myu.edu/dspace-oai/request?verb=Identify
The 'base URL' of this DSpace deployment would be:
http://dspace.myu.edu/dspace-oai/request
It is this URL that should be registered with www.openarchives.org. Note that you can easily change the 'request
' portion of the URL by editing [dspace-source]/etc/oai-web.xml
and rebuilding and deploying dspace-oai.war
.
DSpace provides implementations of the OAICat interfaces AbstractCatalog
, RecordFactory
and Crosswalk
that interface with the DSpace content management API and harvesting API (in the search subsystem).
Only the basic oai_dc
unqualified Dublin Core metadata set is exported at present; this is particularly easy since all items have qualified Dublin Core metadata. When this metadata is harvested, the qualifiers are simply stripped; for example, description.abstract
is exposed as unqualified description
. The description.provenance
field is hidden, as this contains private information about the submitter and workflow reviewers of the item, including their e-mail addresses. Additionally, to keep in line with OAI community practices, values ofcontributor.author
are exposed as creator
values.
To add support for other metadata sets is simply a matter of creating another Crosswalk
implementation, and adding it to the oaicat.properties
file described below.
Note that the current simple DC implementation (org.dspace.app.oai.OAIDCCrosswalk
) does not currently strip out any invalid XML characters that may be lying around in the data. If your database contains a DC value with, for example, some ASCII control codes (form feed etc.) this may cause OAI harvesters problems. This should rarely occur, however. XML entities (such as >
) are encoded (e.g. to >
)
In addition to the implementations of the OAICat interfaces, there are two configuration files relevant to OAI support:
oaicat.properties
This resides as a template in [dspace]/config/templates
, and the live version is written to [dspace]/config
. You probably won't need to edit this; the install-configs
script fills out the relevant deployment-specific parameters. You might want to change the earliestDatestamp
field to accurately reflect the oldest datestamp in the system. (Note that this is the value of the last_modified
column in the Item
database table.)
oai-web.xml
This standard Java Servlet 'deployment descriptor' is stored in the source as [dspace-source]/etc/oai-web.xml
, and is written to /dspace/oai/WEB-INF/web.xml
.
OAI-PMH allows repositories to expose an hierarchy of sets in which records may be placed. A record can be in zero or more sets.
DSpace exposes collections as sets. The organization of communities is likely to change over time, and is therefore a less stable basis for selective harvesting.
Each collection has a corresponding OAI set, discoverable by harvesters via the ListSets verb. The setSpec is the Handle of the collection, with the ':' and '/' converted to underscores so that the Handle is a legal setSpec, for example:
hdl_1721.1_1234
Naturally enough, the collection name is also the name of the corresponding set.
Every item in OAI-PMH data repository must have an unique identifier, which must conform to the URI syntax. As of DSpace 1.2, Handles are not used; this is because in OAI-PMH, the OAI identifier identifies the metadata record associated with the resource. The resource is the DSpace item, whose resource identifier is the Handle. In practical terms, using the Handle for the OAI identifier may cause problems in the future if DSpace instances share items with the same Handles; the OAI metadata record identifiers should be different as the different DSpace instances would need to be harvested separately and may have different metadata for the item.
The OAI identifiers that DSpace uses are of the form:
oai:host name:handle
For example:
oai:dspace.myu.edu:123456789/345
If you wish to use a different scheme, this can easily be changed by editing the value of OAI_ID_PREFIX
at the top of the org.dspace.app.oai.DSpaceOAICatalog
class. (You do not need to change the code if the above scheme works for you; the code picks up the host name and Handles automatically from the DSpace configuration.)
OAI provides no authentication/authorisation details, although these could be implemented using standard HTTP methods. It is assumed that all access will be anonymous for the time being.
A question is, "is all metadata public?" Presently the answer to this is yes; all metadata is exposed via OAI-PMH, even if the item has restricted access policies. The reasoning behind this is that people who do actually have permission to read a restricted item should still be able to use OAI-based services to discover the content.
If in the future, this 'expose all metadata' approach proves unsatisfactory for any reason, it should be possible to expose only publicly readable metadata. The authorisation system has separate permissions for READing and item and READing the content (bitstreams) within it. This means the system can differentiate between an item with public metadata and hidden content, and an item with hidden metadata as well as hidden content. In this case the OAI data repository should only expose items those with anonymous READ access, so it can hide the existence of records to the outside world completely. In this scenario, one should be wary of protected items that are made public after a time. When this happens, the items are "new" from the OAI-PMH perspective.
OAI-PMH harvesters need to know when a record has been created, changed or deleted. DSpace keeps track of a 'last modified' date for each item in the system, and this date is used for the OAI-PMH date stamp. This means that any changes to the metadata (e.g. admins correcting a field, or a withdrawal) will be exposed to harvesters.
As part of each record given out to a harvester, there is an optional, repeatable "about" section which can be filled out in any (XML-schema conformant) way. Common uses are for provenance and rights information, and there are schemas in use by OAI communities for this. Presently DSpace does not provide any of this information.
DSpace keeps track of deletions (withdrawals). These are exposed via OAI, which has a specific mechansim for dealing with this. Since DSpace keeps a permanent record of withdrawn items, in the OAI-PMH sense DSpace supports deletions 'persistently'. This is as opposed to 'transient' deletion support, which would mean that deleted records are forgotten after a time.
Once an item has been withdrawn, OAI-PMH harvests of the date range in which the withdrawal occurred will find the 'deleted' record header. Harvests of a date range prior to the withdrawal will not find the record, despite the fact that the record did exist at that time.
As an example of this, consider an item that was created on 2002-05-02 and withdrawn on 2002-10-06. A request to harvest the month 2002-10 will yield the 'record deleted' header. However, a harvest of the month 2002-05 will not yield the original record.
Note that presently, the deletion of 'expunged' items is not exposed through OAI.
An OAI data provider can prevent any performance impact caused by harvesting by forcing a harvester to receive data in time-separated chunks. If the data provider receives a request for a lot of data, it can send part of the data with a resumption token. The harvester can then return later with the resumption token and continue.
DSpace supports resumption tokens for 'ListRecords' OAI-PMH requests. ListIdentifiers and ListSets requests do not produce a particularly high load on the system, so resumption tokens are not used for those requests.
Each OAI-PMH ListRecords request will return at most 100 records. This limit is set at the top of org.dspace.app.oai.DSpaceOAICatalog.java
(MAX_RECORDS
). A potential issue here is that if a harvest yields an exact multiple of MAX_RECORDS
, the last operation will result in a harvest with no records in it. It is unclear from the OAI-PMH specification if this is acceptable.
When a resumption token is issued, the optional completeListSize
and cursor
attributes are not included. OAICat sets the expirationDate
of the resumption token to one hour after it was issued, though in fact since DSpace resumption tokens contain all the information required to continue a request they do not actually expire.
Resumption tokens contain all the state information required to continue a request. The format is:
from/until/setSpec/offset
from
and until
are the ISO 8601 dates passed in as part of the original request, and setSpec
is also taken from the original request. offset
is the number of records that have already been sent to the harvester. For example:
2003-01-01//hdl_1721_1_1234/300
This means the harvest is 'from' 2003-01-01
, has no 'until' date, is for collection hdl:1721.1/1234, and 300 records have already been sent to the harvester. (Actually, if the original OAI-PMH request doesn't specify a 'from' or 'until, OAICat fills them out automatically to '0000-00-00T00:00:00Z' and '9999-12-31T23:59:59Z' respectively. This means DSpace resumption tokens will always have from and until dates in them.)