Re: Automatically generating keywords from abstracts

From: Eric Lease Morgan <emorgan_at_nyob> Date: Thu, 22 Oct 2020 14:44:19 -0400 To: CODE4LIB_at_LISTS.CLIR.ORG

On Oct 22, 2020, at 2:25 PM, Edward M. Corrado <ecorrado_at_ECORRADO.US> wrote:

> I have a set of just over 60,000 theses and dissertations abstracts that I
> want to automatically create keywords/topics from. Does anyone have any
> recommendations for text mining or other tools to start with?

I do this sort of thing on a regular basis, and I use a two Python libraries/modules:

  1. textacy.ke.scake
  2. textacy.ke.yake

Textacy is built on top of another library called "spaCy". 

To use the libraries one:

  1. gets a string
  2. creates a spaCy doc object from the string
  3. applies the scake or yake methods to the object
  4. gets back a keyword (or phrase) plus a score

Attached is a script which takes a file as input and outputs a tab-delimited stream of keywords/phrases.

--
Eric Morgan

#!/usr/bin/env python

# txt2keywords.sh - given a file, output a tab-delimited list of keywords

# configure
TOPN  = 0.005
MODEL = 'en_core_web_sm'

# require
import textacy.preprocessing
from textacy.ke.scake import scake
from textacy.ke.yake import yake
import spacy
import os
import sys

# sanity check
if len( sys.argv ) != 2 :
	sys.stderr.write( 'Usage: ' + sys.argv[ 0 ] + " <file>\n" )
	quit()

# initialize
file = sys.argv[ 1 ]

# open the given file and unwrap it
with open(file) as f: text = f.read()
text = textacy.preprocessing.normalize.normalize_quotation_marks( text )
text = textacy.preprocessing.normalize.normalize_hyphenated_words( text )
text = textacy.preprocessing.normalize.normalize_whitespace( text )

# compute the identifier
id = os.path.basename( os.path.splitext( file )[ 0 ] )

# initialize model
maximum = len( text ) + 1
model   = spacy.load( MODEL, max_length=maximum )
doc     = model( text )

# output a header
print( "id\tkeyword" )

# track found keywords to avoid duplicates
keywords = set()

# process and output each keyword with yake, will produce unigrams
for keyword, score in ( yake( doc,  topn=TOPN ) ) :
	if keyword not in keywords:
		print( "\t".join( [ id, keyword ] ) )
		keywords.add(keyword)

# process and output each keyword with scake, will typically produce keyphrases
# removing lemmatization with normalize=None seems to produce better results
for keyword, score in ( scake( doc, normalize=None, topn=TOPN ) ) :
	if keyword not in keywords:
		print( "\t".join( [ id, keyword ] ) )
		keywords.add(keyword)

# done
exit()