contentdm and ocred text

From: Eric Lease Morgan <00000107b9c961ae-dmarc-request_at_nyob>
Date: Thu, 17 Oct 2024 13:03:04 -0400
To: CODE4LIB_at_LISTS.CLIR.ORG
Given a CONTENTdm item that has been OCRed, is it possible to download the OCRed text, and if so, then what shape does the URL take?

Using OAI-PMH I can list all the records in a CONTENTdm set. Here is an abbreviated, redacted example of a specific record:

<record>
  <header>
    <identifier>oai:cdm1224.contentdm.oclc.org:p1224coll8/12</identifier>
    <datestamp>2015-08-06</datestamp>
    <setSpec>p1224coll8</setSpec>
  </header>
  <metadata>
    <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/..." >
    <dc:title>'A' Company underground</dc:title>
    <dc:publisher>Co. 'A' Underground</dc:publisher>
    <dc:date>1972</dc:date>
    <dc:language>English</dc:language>
    <dc:coverage>United States</dc:coverage>
    <dc:format>XML</dc:format>
    <dc:rights>Copyright in most of the documents...</dc:rights>
    <dc:source>foo ba</dc:source>
    <dc:type>Text; Image</dc:type>
    <dc:identifier>foobarNewsletter001000</dc:identifier>
    <dc:identifier>http://cdm1224.contentdm.oclc.org/cdm/ref/collection/p1224coll8/id/12</dc:identifier>
    </oai_dc:dc>
  </metadata>
</record>

There are three identifiers in the record:

  oai:cdm1224.contentdm.oclc.org:p1224coll8/12
  foobarNewsletter001000
  http://cdm1224.contentdm.oclc.org/cdm/ref/collection/p1224coll8/id/12
  
When I visit the third (and redacted) identifier I am presented with a viewer page. The viewer page offers the opportunity to search. When I search my query terms are highlighted on the scanned image. Thus, I know the item has been OCRed.

Is it possible to reverse-engineer any one of the identifiers, above, to point to the OCR'ed text, and if so, then how?

In the end, I want to download the OCRed text of a given set of digitized content. I will also download the texts' bibliographics. Finally, I will use text mining and natural language processing to evaluate the content, look for patterns, and address a faculty member's research questions.

Using OAI-PMH I can get the bibliographics, but how can I get the OCRed text?

--
Eric Morgan
Center for Digital Scholarship
Hesburgh Libraries
University of Notre Dame
Received on Thu Oct 17 2024 - 13:03:15 EDT