Re: contentdm and ocred text

From: Krc, Matthew <00000035d8cb5454-dmarc-request_at_nyob> Date: Mon, 21 Oct 2024 13:42:26 +0000 To: CODE4LIB_at_LISTS.CLIR.ORG

NONCONFIDENTIAL // EXTERNAL

Eric,

I don’t think it’s possible through an OAI-PMH request alone, but given a collection alias and item identifying pointer value, you can make a separate request to the CONTENTdm dmGetItemInfo API endpoint and may be able to get the full text in the ‘transc’ (transcription) field there.

Example call for a response in JSON, based on your identifiers below:

https://cdm1224.contentdm.oclc.org/digital/bl/dmwebservices/index.php?q=dmGetItemInfo/p1224coll8/12/json

More info on the dmGetItemInfo endpoint here: https://help.oclc.org/Metadata_Services/CONTENTdm/Advanced_website_customization/API_Reference/CONTENTdm_API/CONTENTdm_Server_API_Functions_dmwebservices#dmGetItemInfo

Hope this helps.

-Matt

From: Code for Libraries <CODE4LIB_at_LISTS.CLIR.ORG> on behalf of Eric Lease Morgan <00000107b9c961ae-dmarc-request_at_LISTS.CLIR.ORG>
Date: Thursday, October 17, 2024 at 12:04 PM
To: CODE4LIB_at_LISTS.CLIR.ORG <CODE4LIB_at_LISTS.CLIR.ORG>
Subject: [External] [CODE4LIB] contentdm and ocred text
NONCONFIDENTIAL // EXTERNAL

PLEASE NOTE: This email is not from a Federal Reserve address.
Do not click on suspicious links. Do not give out personal or bank information to unknown senders.

Given a CONTENTdm item that has been OCRed, is it possible to download the OCRed text, and if so, then what shape does the URL take?

Using OAI-PMH I can list all the records in a CONTENTdm set. Here is an abbreviated, redacted example of a specific record:

<record>
  <header>
    <identifier>oai:cdm1224.contentdm.oclc.org:p1224coll8/12</identifier>
    <datestamp>2015-08-06</datestamp>
    <setSpec>p1224coll8</setSpec>
  </header>
  <metadata>
    <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/..." >
    <dc:title>'A' Company underground</dc:title>
    <dc:publisher>Co. 'A' Underground</dc:publisher>
    <dc:date>1972</dc:date>
    <dc:language>English</dc:language>
    <dc:coverage>United States</dc:coverage>
    <dc:format>XML</dc:format>
    <dc:rights>Copyright in most of the documents...</dc:rights>
    <dc:source>foo ba</dc:source>
    <dc:type>Text; Image</dc:type>
    <dc:identifier>foobarNewsletter001000</dc:identifier>
    <dc:identifier>http://cdm1224.contentdm.oclc.org/cdm/ref/collection/p1224coll8/id/12</dc:identifier>
    </oai_dc:dc>
  </metadata>
</record>

There are three identifiers in the record:

  oai:cdm1224.contentdm.oclc.org:p1224coll8/12
  foobarNewsletter001000
  http://cdm1224.contentdm.oclc.org/cdm/ref/collection/p1224coll8/id/12

When I visit the third (and redacted) identifier I am presented with a viewer page. The viewer page offers the opportunity to search. When I search my query terms are highlighted on the scanned image. Thus, I know the item has been OCRed.

Is it possible to reverse-engineer any one of the identifiers, above, to point to the OCR'ed text, and if so, then how?

In the end, I want to download the OCRed text of a given set of digitized content. I will also download the texts' bibliographics. Finally, I will use text mining and natural language processing to evaluate the content, look for patterns, and address a faculty member's research questions.

Using OAI-PMH I can get the bibliographics, but how can I get the OCRed text?

--
Eric Morgan
Center for Digital Scholarship
Hesburgh Libraries
University of Notre Dame

IMPORTANT: This e-mail message, including attachments, is for the sole use of the intended recipient(s) and may contain confidential or proprietary information.  If you are not the intended recipient, please immediately contact the sender by replying to the e-mail and destroying all copies of the original message.