David Johnsons and computer-generated author sets

From: McGrath, Kelley C. <kmcgrath_at_nyob> Date: Mon, 10 Sep 2007 08:14:27 -0400 To: NGC4LIB_at_listserv.nd.edu

I found the discussion of the list of David Johnsons and the ability of
a computer to group a set of digital texts by author interesting.

It seems to me that the hypothesis that our writing styles have some
sort of patterns that are susceptible to statistical analysis, given a
sufficiently long example, seems reasonable and that a computer could
potentially be programmed to arrange a set of digital texts into
groupings by author. Potentially these groupings would be more accurate
than human-generated groupings, which can be susceptible to putting
together two people with the same or similar names who write on similar
topics or splitting apart one person who writes on dissimilar topics. On
the other hand, how likely is it that a given author would or could
change their writing style enough to confuse the computer? Would the
computer recognize poetry and prose by the same person?

Assuming accurate groupings of texts by author, it seems to me that
there may still be some problems in labeling the authors and displaying
a listing of the authors' names associated with those groupings. For one
thing, if the authors' names are not marked in some way in the digital
text, would the computer be able to identify the author names associated
with the texts? Obviously, if the names are marked, this isn't a problem
and perhaps if the layout of the texts is sufficiently regular or
predictable, the computer could do this. Or perhaps it would require
some human intervention. The computer by itself probably couldn't come
up with a list of names qualified by dates as given in the original
example, but it could qualify them in some other way (e.g., David
Johnson, author of "X") and if it had access to the data, it could group
the variant forms of one author's name (and probably more
comprehensively than catalogers are wont to do in authority records).

A few other things that struck me. In some cases, the groupings may not
be associated with what we typically think of as the "author." Take
translations for example. Would a computer analysis tend to group by
translator based on the characteristics of their writing styles? Would
it put two different translators' English versions of Brothers Karamazov
under different authors or put all of Constance Garnett's translations
of various Russian authors under one author? How about ghostwritten
materials or materials written by groups, such as government agency
reports (which even if they were largely written by one identifiable
person, would ideally be grouped by the sponsoring agency)?

Also, in the list of David Johnsons in a library authority file, there
are likely a number of people who are not authors of texts. Some of them
may be editors or compilers who could not be identified by this method.
Some of them would probably be creators of other types of material such
as film directors or music composers. It would be harder to teach a
computer to analyze film directors' styles, but not necessarily
impossible. It seems to me, though, that it would be exceedingly
difficult for a computer to recognize that the director of film "X" is
the same as the author of book "Y" through this type of analysis.

Just a few thoughts...

Kelley McGrath
kmcgrath_at_bsu.edu