Re: "Repositories", OAI-PMH and web crawling

From: raffaele messuti <raffaele.messuti_at_nyob> Date: Mon, 27 Feb 2012 18:10:41 +0100 To: CODE4LIB_at_LISTSERV.ND.EDU

On Sun, Feb 26, 2012 at 3:42 PM, Godmar Back <godmar_at_gmail.com> wrote:
> May I ask a side question and make a side observation regarding the
> harvesting of full text of the object to which a OAI-PMH record refers?

In Italy institutional repositories of theses are required to publish
metadata as mpeg21 DIDL (available out of the box for EPrints or
Dspace) for the harvesting and deposit process runned by the national
libraries of Rome and Florence (http://www.depositolegale.it).

take a look at this example from IR of University of Bologna[1]
-- dii:Identifier is the html web page (jump off page)
-- didl:Component represent each full text document composing the Item
-- dc:rights use info:eu-repo vocabulary[2]

full text documents under access limitations or embargo are allowed to
be harvested only by defined IP of crawlers (heritrix, wget-warc)
located inside the national libraries (also this with dspace or
eprints is easy to do)

here the guidelines from the Driver Project[3] which we borrowed

[1] http://amsdottorato.cib.unibo.it/cgi/oai2?verb=GetRecord&metadataPrefix=didl&identifier=oai:amsdottorato.cib.unibo.it:4182
[2] http://wiki.surffoundation.nl/display/standards/info-eu-repo
[3] https://issue.guidelines.driver.research-infrastructures.eu/wiki/UseOfCompoundObjectWrapping

ciao

--
raffaele - @atomotic