lexicon helpers

From: Eric Lease Morgan <00000107b9c961ae-dmarc-request_at_nyob> Date: Thu, 29 May 2025 14:07:34 -0400 To: CODE4LIB_at_LISTS.CLIR.ORG

Lately I have been doing a whole lot of exploration in the area of lexicons, and I'm sharing a bit of what I've developed -- Lexicon Helpers -- in the hopes of generating discussion of lexicons in library work. The results of the explorations are temporarily availabled at the following URL:

  https://distantreader.org/tmp/lexicon-helpers.zip

From the readme file:

  This folder/directory contains a set of Bash shell and Python
  scripts used to create and enhance lexicons -- lists of
  desireable words rooted in Distant Reader study carrels. As such,
  these lexicons are kinda, sorta the complements to or the
  inverses of stop words. They are intended to model desirable
  ideas and concepts, and they are intended to be used to query,
  filter, and describe sentences and documents in narrative
  corpora. The scripts exploit the Distant Reader Toolbox [1], but
  the concepts behind the Helpers are appropos to any corpus and
  the investigation of it.

  Many of the scripts in the bin dirctory are used to create
  lexicons. They are listed below in an order of least complexity
  to greatest complexity. Remember, in order for these scripts to
  work, one needs to install the Distant Reader Toolbox and have at
  least one study carrel in their local library:

    * initialize-with-unigrams.sh - given the name of a study carrel
      and an integer (N), output a list of the N most frequent words,
      sans stop words

    * initialize-with-keywords.sh - given the name of a study carrel
      and an integer (N), output a list of the N most frequent
      keywords, sans stop words

    * initialize-with-nouns.sh - given the name of a study carrel and
      an integer (N), output a list of the N most frequent nouns

    * initialize-with-file.sh - given the name of a study carrel and
      the name of a file, copy the given file to the carrel's lexicon;
      useful when one has specific ideas to explore and those ideas are
      not explicitly hightlighed as unigrams, nouns, or keywords;
      example files can be found in the etc directory

    * initialize-with-all.sh - given the name of a study carrel and
      an integer (N), output a list of the N most frequent unigrams,
      nouns, keywords, the keywords' semantically related words

  Some of the scripts in the bin directory are used to modify and
  enhance existing lexicons:

    * lexicon2variants.py - given the name of a study carrel, output
      all variations of the given lexicon words found in the carrel;
      for example, if a lexicon word is "library", then the output
      ought to include "libraries", "libraians", "librarianship", etc.

    * lexicon2variants.sh - a front-end to lexicon2variants.py

    * lexicon2related.py - given the name of a study carrel, output
      lexicon words and their semantically similar words; rooted in the
      concept of word embedding, this script identifies words often
      used in the "same breath" as the given word; for example, if the
      given word is "love", then a semantically related word might be
      "relationship"

    * lexicon2related.sh - a front-end to lexicon2related.py

    * lexicon2synonyms.py - given the name of a study carrel, use
      WordNet to identify and output synonyms of lexicon words; this
      script will most likely output words not necessarily found in a
      study carrel

  The balance of the scripts in the bin directory output network
  graph files used to evaluate and visualize characteristics of a
  lexicon. These scripts are both cool and kewl; these scripts can
  be used to illustrate features, shapes, and relationships between
  items in a lexicon. After you run these scripts and import them
  into something like Gephi [2], you will be able to describe your
  lexicon in nuanced ways. For example, you will be able to
  identify both strengths and weaknesses of a lexicon:

    * lexicon2vectors.py - given the name of a study carrel and an
      integer (N), output an edges file with three columns: 1) source,
      2) target, and 3) weight; the value of source is a lexicon word,
      target is a semantically related word, and weight is the semantic
      distance between source and target

    * lexicon2hypernyms2gml.py - given a study carrel, output a graph
      markup language file hightlighting nouns in the lexicon and their
      hypernyms ("broader terms"); uses WordNet to do the good work,
      and for example, using this script brings to light concepts such
      as "mythical characters" when lexicon words include "ulysses",
      "achilles", and "hector"

  [1] Distant Reader Toolbox - https://reader-toolbox.readthedocs.io
  [2] Gephi - https://gephi.org

Why should you care? Because the process of coming up sets of words connoting and alluding to concepts is difficult. If I were to ask you to list twelve colors, I assert the process would be a bit challenging. On the other hand, if I were to give you a list of words and then ask you to identify the colors, then the process would be easy. The Helpers address this problem. One is expected to automatically generate a lexicon, and then curate it. Moreover, once a lexicon is manifested as a file, it an almost trivial to use the lexicon as input to queries, thus eliminating a whole lot of typing. Again, ease of use.

These scripts work very well for me. They make it easy for me to compare and contrast study carrel -- data set -- contents. They make it easy for me to extract sentences and documents represented by my sets of curated words. They make my work more scalable.

Fun with data science, data science with words.

--
Eric Lease Morgan <emorgan_at_nd.edu>
University of Notre Dame