HathiTrust Research Center Extracted Features 2.0

From: Downie, J Stephen <jdownie_at_nyob>
Date: Thu, 4 Jun 2020 20:56:07 +0000
To: CODE4LIB_at_LISTS.CLIR.ORG
Hi colleagues:

Because many of us teach or lead various text analytics and data mining classes and projects, some might find this open data set helpful.

Please share widely. The dataset was created to be used by all and sundry in and out of the classroom.

Discoveries await!

Cheers,
Stephen
************************************
HTRC is excited to announce the release of the Extracted Features 2.0 dataset! This new version of Extracted Features offers volume- and page-level data for 17+ million volumes in the HathiTrust Digital Library. The data include:

  *   Bibliographic metadata
  *   Computationally-inferred metadata about the page, such as language and line counts
  *   Tokens (words), parts of speech, and their per-page counts
Overall, the dataset represents more than 6 billion pages of text from the digital library and includes nearly 3 trillion tokens from the corpus.

Not only does this release extend the number of volumes in HathiTrust available as Extracted Features, it also incorporates linked data such that names in the files are linked to external authorities when possible.

Learn more about the release and data schema: https://wiki.htrc.illinois.edu/x/kYC2B<https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.htrc.illinois.edu_x_kYC2B&d=DwMFAg&c=Y6HT0gyZH_Z4ZSRJdNYJeQ&r=PoPNiojADUuqnTf-KX_TBzefh1aDEwmrF4a1xlfAZ-I&m=jIpyTDd57dx1dpU4liD2-4OMyQd5KxqDmGLDuV8Ooy8&s=33FGLOvfqEpo-r7Tl8B7zyKLrk8DU6M7vuPzUWEleA4&e=>
Download Extracted Features 2.0 files: https://wiki.htrc.illinois.edu/x/_QGGAQ<https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.htrc.illinois.edu_x_-5FQGGAQ&d=DwMFAg&c=Y6HT0gyZH_Z4ZSRJdNYJeQ&r=PoPNiojADUuqnTf-KX_TBzefh1aDEwmrF4a1xlfAZ-I&m=jIpyTDd57dx1dpU4liD2-4OMyQd5KxqDmGLDuV8Ooy8&s=yJEVVbmvHZlQ_NbZhEoHR_LsXCGneLL3ZnqN5JIv4Wo&e=>

Contact htrc-help_at_hathitrust.org<mailto:htrc-help_at_hathitrust.org> with any questions.
Received on Thu Jun 04 2020 - 16:57:51 EDT