Re: PDF with OCR from different source

From: Rasan Rasch <rasan_at_nyob> Date: Fri, 8 May 2020 16:08:33 -0400 To: CODE4LIB_at_LISTS.CLIR.ORG

Hi Kim,

One solution would be to use the pdfimages utility from Poppler to
extract all the images from the PDF into a directory.  You would then
place the corresponding hocr files in the same directory and then
run the hocr-pdf utility from hocr-tools.

Both software packages are readily available on many Linux systems.

https://poppler.freedesktop.org/
https://github.com/tmbdev/hocr-tools

Thanks,
Rasan
NYU Digital Library

On Wed, May 6, 2020 at 2:42 PM Kimberly Kennedy <kimberlymkennedy_at_gmail.com>
wrote:

> I have an unusual situation. I've created a PDF that I want to be text
> searchable. However, I would like to use OCR data from a different source
> than that document. Is it possible to add a text file as the OCR layer to
> an existing PDF?
>
> Any ideas would be appreciated!
>
> Thanks,
>
> Kim
>
>
> Kimberly Kennedy
> Digital Production Coordinator
> Northeastern University Library
> ki.kennedy_at_northeastern.edu
>