hathtitrust

From: Eric Lease Morgan <emorgan_at_nyob> Date: Thu, 25 Jan 2018 09:54:59 -0500 To: CODE4LIB_at_LISTS.CLIR.ORG

Working with the HathiTrust Research Center data can be fun, and I sincerely believe it is an under-utilized system, but creating collections sans duplicates is difficult. Has anybody here figured out a “kewl” way to remove duplicates.

Creating HathiTrust collections is easy: do search, select items of interest, and repeat until tired. One can then download a CSV file describing the collection, but upon closer inspection MANY of the titles are repeated. I know why this has happened, alas, but how might I automatically/programmatically resolve this issue? I’ve begun experimenting with OpenRefine. Does anybody else have other suggestions? 

—
Eric Morgan