Re: to stop word, or not to stop word. that is the question

From: Marijane White <whimar_at_nyob>
Date: Fri, 10 Jul 2020 21:04:34 +0000
To: CODE4LIB_at_LISTS.CLIR.ORG
Might also be worth looking into the work on Statistically Improbably Phrases, which tends to use tf-idf.
https://en.wikipedia.org/wiki/Statistically_improbable_phrase (potentially promising reading in the references for this article)

-marijane

From: Code for Libraries <CODE4LIB_at_LISTS.CLIR.ORG> on behalf of "Stuart A. Yeates" <syeates_at_GMAIL.COM>
Reply-To: Code for Libraries <CODE4LIB_at_LISTS.CLIR.ORG>
Date: Friday, July 10, 2020 at 1:40 PM
To: "CODE4LIB_at_LISTS.CLIR.ORG" <CODE4LIB_at_LISTS.CLIR.ORG>
Subject: Re: [CODE4LIB] to stop word, or not to stop word. that is the question

Sounds like a classical use for the  tf–idf measure.

For those with no background in information retrieval, see
https://en.wikipedia.org/wiki/Tf%E2%80%93idf


cheers
stuart

--
...let us be heard from red core to black sky

On Sat, 11 Jul 2020 at 06:58, Eric Lease Morgan <emorgan_at_nd.edu<mailto:emorgan_at_nd.edu>> wrote:

To stop word, or not to stop word? That is the question.

Seriously, I am working with a team of people to index and analyze a set of 65,000 - 100,000 full text scientific journal articles, and all of the articles are on the topic of COVID-19. [1] We have indexed the data set and we have created subsets of the data, affectionately called "study carrels". Each study carrel is characterized with a short name and a few bibliographic-like features. [2] Within each study carrel are a number of different analyses, such as ngram frequencies, parts-of-speech enumerations, and topic modeling.

Each article in each carrel also has a set of "keywords" extracted from it. These keywords are computed, and for all intents & purposes, the computation is pretty good. For example, see a set of keywords from a particular carrel. [3] Unfortunately, many of the study carrels have very very very similar sets of keywords. Again, if you peruse the set of all the carrels [2] you see the preponderance of keywords such as "cell", "covid-19", "SARS", and "patient". These words happen so frequently that they become (almost) meaningless.

My questions to y'all are, "When and where should I add something like 'cell', or better yet 'covid-19', to my list of stopwords?"


[1] data set of articles - https://www.semanticscholar.org/cord19

[2] study carrels - https://cord.distantreader.org/carrels/INDEX.HTM

[3] example keywords - https://cord.distantreader.org/carrels/kaggle-risk-factors/index.htm#keywords


--
Eric Morgan

Received on Fri Jul 10 2020 - 17:06:46 EDT