Re: hathitrust api

From: Conal Tuohy <conal.tuohy_at_nyob>
Date: Mon, 11 Feb 2019 12:38:51 +1000
To: CODE4LIB_at_LISTS.CLIR.ORG
On Mon, 11 Feb 2019 at 11:51, Eric Lease Morgan <emorgan_at_nd.edu> wrote:

>
> I've finally figured out how to get raw OCR text out of the HathiTrust
> API, but it is really slow. Any hints out there?

...


> Am I missing something when it comes to the API?
>

You may have tried this already, but it seems that Hathi also offer PDF-
and EBM-formatted data at the volume level. Do those formats include the
OCR text? I have seen this done in PDF before (and I've done it myself):
the files contain bitmap page images but the OCR text is also there, in a
layer beneath the images.

-- 
Conal Tuohy
http://conaltuohy.com/
@conal_tuohy
Received on Sun Feb 10 2019 - 21:42:18 EST