Extracting Text From .tiff Files

From: Gavin Spomer <spomerg_at_nyob>
Date: Mon, 12 May 2014 15:01:22 -0700
To: CODE4LIB_at_LISTSERV.ND.EDU
Hello folks, 

I'm in the process of migrating a student newspaper collection, currently implemented with ResCarta, into our new bepress institutional repository. ResCarta has each page of a newspaper stored as a tiff file. Not only does the tiff file contain the graphics data, but it has some metadata in xml format and the fulltext of the page. I know this because I opened up some of the tiffs with a plain-text editor (Vim). 

Although I can see the text in the file, I've only been about 90% accurate in extracting it with a script. Some of those "weird" characters seem to do some wonky things when doing file IO for some reason. Is there a more reliable way to extract text stored in a tiff file? I've Googled and Googled and have pulled up almost nothing. But there's got to be a way, since ResCarta stores it there and can extract it. 

Any ideas? 
Gavin Spomer
Systems Programmer
Brooks Library
Central Washington University
Received on Mon May 12 2014 - 18:09:48 EDT