Ré reRe: [CODE4LIB9] internet archive api

From: Chad Fennell <fenne035_at_nyob> Date: Mon, 18 Sep 2017 16:07:41 -0500 To: CODE4LIB_at_LISTS.CLIR.ORG

On Sep 18, 2017 3:30 PM, "raffaele messuti" <raffaele_at_docuver.se> wrote:

On 18/09/17 21:37, Eric Lease Morgan wrote:
> A cool collection of early English print materials is available at the
following URL:
>   https://archive.org/details/bplsceep
>
> Again, can I programmatically read the contents of a Internet Archive
collection?
this tool is what you need:
https://internetarchive.readthedocs.io/en/latest/

to get a list of all items of the collection:
$ ia search -i collection:bplsctpbs > bplsctpbs.txt

the txt file contain an identifier on each row

$ wc -l bplsctpbs.txt
     824 bplsctpbs.txt

$ head -n5 bplsctpbs.txt
accountofcountri00dobb_0
accountofenglish01lang
accountofenglish02lang
accountofenglish03lang
admirableeuentss00camu

then you can have metadata of all items
(using parallel https://www.gnu.org/software/parallel/ )

$ parallel ia metadata {} :::: bplsctpbs.txt > all.json

--
raffaele_at_docuver.se