Information Retrieval List Digest 216 (June 6)
URL = http://hegel.lib.ncsu.edu/stacks/serials/irld/irld-216

6.1
June 6, 1994
Volume XI, Number 23
Issue 216
**********************************************************
IV. PROJECT WORK
    A. Abstracts
       1. IR-Related Dissertation Abstracts
**********************************************************
IV. PROJECTS
IV.A.1.
Fr: Susanne M. Humphrey <humphrey@nlm.nih.gov>
Re: Selected IR-Related Dissertation Abstracts

The following are citations selected by title and abstract as
being related to Information Retrieval (IR), resulting from a
computer search, using BRS Information Technologies, of the
Dissertation Abstracts Online database produced by University
Microfilms International (UMI). Included are UMI order number,
title, author, degree, year, institution; number of pages, one or
more Dissertation Abstracts International (DAI) subject
descriptors chosen by the author, and abstract. Unless otherwise
specified, paper or microform copies of dissertations may be
ordered from University Microfilms International, Dissertation
Copies, Post Office Box 1764, Ann Arbor, MI 48106; telephone for
U.S. (except Michigan, Hawaii, Alaska): 1-800-521-3042, for
Canada: 1-800-268-6090. Price lists and other ordering and
shipping information are in the introduction to the published
DAI. An alternate source for copies is sometimes provided.
Dissertation titles and abstracts contained here are published
with permission of University Microfilms International,
publishers of Dissertation Abstracts International (copyright by
University Microfilms International), and may not be reproduced
without their prior permission.

AN University Microfilms Order Number ADG93-30408.
AU BENSCH, PETER ALLAN.
TI OCCURRENCE-BASED WORD CATEGORIZATION.
IN University of California, San Diego Ph.D. 1993, 185 pages.
SO DAI v54(06), SecB, pp3175.
DE Computer Science.  Language, Linguistics.  Artificial Intelligence.
AB We have embarked on a research program that we call OCCURRENCE-BASED
   processing.  This methodology, quite simply, monitors the contexts in
   which data elements appear.  As such, it is similar to co-occurrence
   statistical studies, but we do not tally the number of times the data
   element occurs in the context--we simply record that it has occurred.
   Thus, one occurrence is the same as 1,000 occurrences.

   We have been applying this methodology to the task of categorizing
   words from a natural language.  In particular, we have been applying
   it to corpora consisting of samples of written English text (edited
   newspaper articles and unedited technical articles).  Shifting the
   emphasis from co-occurrence likelihoods (frequency-based studies) to
   co-occurrence possibilities (occurrence-based studies) has allowed us
   to isolate interesting "natural" word categories from moderate-sized
   corpora.  The preliminary investigations mentioned in this
   dissertation have shown that occurrence-based processing is a
   research approach that warrants further investigation.

AN University Microfilms Order Number ADG93-27155.
AU BLAKE, JONATHAN DRESSER.
TI CORPUS-BASED EXAMPLE PARSING OF NATURAL LANGUAGE USING BEST-ONLY AND
   EXHAUSTIVE ALGORITHMS.
IN Northwestern University Ph.D. 1993, 196 pages.
SO DAI v54(06), SecB, pp3176.
DE Computer Science.  Language, Linguistics.
AB This dissertation discusses a program to parse natural language using
   data obtained from a collection of pre-parsed training corpora.  The
   data obtained for this research consists of collections of phrases
   for each of the words in the lexicon.  For each of the words in a
   test sentence, therefore, it is possible to generate a list of
   possible phrases (based on how the word was used in the training
   set).  Two algorithms are described that combine the lists of phrases
   to determine possible parses.  This combination is similar to a large
   scale constraint satisfaction problem.

   The first algorithm is meant to provide baseline information about
   this procedure.  At each stage of the process of attempting to fill
   the slots of candidate phrases, only the most likely sub-phrase is
   chosen (the lists of phrases are listed in order of their frequency).
   This algorithm is recursive.

   The second algorithm is an exhaustive solution to the problem of
   merging the possible lists.  It is similar to the best-only method
   mentioned above, but at each stage all possible sub-phrases are
   returned, not just the most frequent.

   Two different modes of operating the best-only algorithm are
   introduced, one of which compares directly to the exhaustive
   algorithm.

   The first algorithm provides interesting results, although the
   performance is less than optimal.  The second algorithm is a lot more
   effective, and the accuracy grows with the increase in the size of
   the training set.  This shows two things.  First, the exhaustive
   method appears to be effective.  Second, the data used for this
   project is quite small, and larger data sets will certainly provide
   better results.

AN University Microfilms Order Number ADG93-31757.
AU BRILL, ERIC DAVID.
TI A CORPUS-BASED APPROACH TO LANGUAGE LEARNING.
IN University of Pennsylvania Ph.D. 1993, 165 pages.
SO DAI v54(06), SecB, pp3177.
DE Computer Science.  Language, Linguistics.
AB One goal of computational linguistics is to discover a method for
   assigning a rich structural annotation to sentences that are
   presented as simple linear strings of words; meaning can be much more
   readily extracted from a structurally annotated sentence than from a
   sentence with no structural information.  Also, structure allows for
   a more in-depth check of the well-formedness of a sentence.  There
   are two phases to assigning these structural annotations: first, a
   knowledge base is created and second, an algorithm is used to
   generate a structural annotation for a sentence based upon the facts
   provided in the knowledge base.  Until recently, most knowledge bases
   were created manually by language experts.  These knowledge bases are
   expensive to create and have not been used effectively in
   structurally parsing sentences from other than highly restricted
   domains.  The goal of this dissertation is to make significant
   progress toward designing automata that are able to learn some
   structural aspects of human language with little human guidance.  In
   particular, we describe a learning algorithm that takes a small
   structurally annotated corpus of text and a larger unannotated corpus
   as input, and automatically learns how to assign accurate structural
   descriptions to sentences not in the training corpus.  The main tool
   we use to automatically discover structural information about
   language from corpora is transformation-based error-driven learning.
   The distribution of errors produced by an imperfect annotator is
   examined to learn an ordered list of transformations that can be
   applied to provide an accurate structural annotation.  We demonstrate
   the application of this learning algorithm to part of speech tagging
   and parsing.  Successfully applying this technique to create systems
   that learn could lead to robust, trainable and accurate natural
   language processing systems.

AN University Microfilms Order Number ADG93-29544.
AU DOERMANN, DAVID SCOTT.
TI DOCUMENT IMAGE UNDERSTANDING: INTEGRATING RECOVERY AND
   INTERPRETATION.
IN University of Maryland College Park Ph.D. 1993, 272 pages.
SO DAI v54(06), SecB, pp3180.
DE Computer Science.
AB Many document image understanding problems require a more
   comprehensive examination of document features than is typically
   deemed necessary for recognition tasks.  We believe that these
   problems require a detailed analysis of stroke and sub-stroke
   features in the document image with the goal of obtaining information
   about the environment or process which created the document and
   establishing a context for understanding.

   We introduce the concept of recovery into the document domain.  We
   provide a "stroke platform" representation which establishes a
   verifiable "link to the pixels" and demonstrate its usefulness for
   recovery tasks.  This representation allows us to overcome many of
   the problems associated with the rapid, irreversible abstraction
   associated with traditional document processing methods and provides
   the basic framework for our analysis of handwritten documents.  By
   obtaining a detailed description of the document and its properties,
   we are able to establish a context for analysis and validate
   assumptions about the domain.  This dissertation presents our work on
   several document image understanding problems: (1) demonstrating the
   successful use of the stroke platform for the problem of interpreting
   and reconstructing junctions and endpoints, (2) exploring the effects
   of the handwriting process on the document by the development of a
   model for instrument grasp and a study of its effects on pressure
   features, (3) posing and providing an approach to the problem of
   recovering temporal information from static images of handwriting,
   (4) addressing various sub-tasks of the problem of processing form
   documents, and (5) extending the detailed analysis philosophy to
   demonstrate its feasibility in related document domains.

AN University Microfilms Order Number ADG93-30384.
AU HUTCHES, DAVID JOHN.
TI DATA STRUCTURES AND ALGORITHMS FOR THE EFFICIENT REPRESENTATION AND
   RETRIEVAL OF INCREMENTAL LEXICAL INFORMATION.
IN University of California, San Diego Ph.D. 1993, 120 pages.
SO DAI v54(06), SecB, pp3185.
DE Computer Science.
AB Ludwig Wittgenstein noted that "One cannot guess how a word
   functions.  One has to look at its use and learn from that"
   (Wittgenstein 1968, 109).  J. R. Firth commented on the meaning of
   words at the "collocational level" and coined the now oft-repeated
   phrase "You shall know a word by the company it keeps!" (Firth 1957,
   11).  In recent years there has been a significant resurgence of
   interest in the use of statistics for the analysis of linguistic
   data; with the on-line availability of large collections of text,
   sophisticated analyses of these corpora are now possible.  This fact,
   coupled with a renewed awareness of the importance of the lexicon in
   language processing has led to experiments which attempt to cast a
   variety of linguistic phenomena as statistical in nature.  Much work
   has been done in the statistical analysis of large corpora, but
   little attention has been paid to the problem of constructing a
   lexicon which encodes the relationship of words to one another in
   such a way that these relationships are efficiently stored and
   retrieved, especially a lexicon based on untagged corpora.  The
   existence of such a lexicon is necessary not only from the
   theoretical perspective of providing a tool for the statistical
   analysis of linguistic data, but also as an integral part of many
   tasks involving natural language processing, such as information
   retrieval.

   In the work described here, we attempt to accomplish two interrelated
   tasks.  First, we examine the storage of lexico-statistical
   information as a computational problem and characterize the data with
   which one must deal in the processing of large textual corpora; we
   use this treatment in the construction of a working
   lexico-statistical database.  Second, we validate the assumptions
   made in building this database by using the information contained
   therein in the service of a particular linguistic task, a
   statistically based examination of lexical classification and
   abstraction.

AN University Microfilms Order Number ADG93-29530.
AU KERVEN, DAVID SCOTT.
TI AN ABSTRACT ARCHITECTURE FOR DISTRIBUTED, OBJECT-ORIENTED HYPERMEDIA
   SYSTEMS.
IN University of Southwestern Louisiana Ph.D. 1993, 224 pages.
SO DAI v54(06), SecB, pp3186.
DE Computer Science.  Information Science.
AB The origins of the hypermedia can be traced back to 1945 with the
   conception of the memex system, a mechanized scientific literature
   browsing system.  However, this system's speed and efficiency were
   limited by the mechanized nature of its components.  With the advent
   of computers, and later, high-power workstations, the concepts behind
   memex became realizable.

   Hypermedia technology has progressed significantly in recent years
   and been applied to a variety of application domains.  This
   technology is of significant use in the computer supported
   collaborative work domain since hypermedia environments are capable
   of supporting established collaborative work models.  However,
   existing systems do not realize this capability to its full potential
   based upon an examination of a representative set of existing
   environment against an established set of dimensions for CSCW
   hypermedia.  The primary goal of the research proposed here is to
   produce a unified, abstract model for a distributed, object-oriented
   hypermedia environment that will be capable of supporting such
   collaborative endeavors.

   Three subgoals were generate to achieve this: the development of an
   abstract document framework capable of supporting a large scale CSCW
   hypermedia environment, the development of an abstract distributed
   architecture for CSCW hypermedia, and the development of template
   support facilities within the designed framework and architecture.

   The abstract framework developed provides the capacity for node and
   link attributes.  Further, it incorporates node and link security
   facilities and allows for ease of interoperability with externally
   created information objects.  Finally, it establishes a standard
   interface for front-end user interfaces providing for ease of
   programmability.

   The resultant distributed architecture allows for the distribution of
   documents a network in an efficient and user transparent manner,
   establishes object level concurrent access security, and provides
   node and link network security facilities.

   The templating mechanisms incorporated provide the facilities for
   supporting arbitrary authoring cognitive models.  They reduce
   authoring overhead by providing reusable node and document
   structuring.  Finally, they serve as additional high level
   navigational tools.

   The abstract architecture developed is capable of supporting all of
   the established CSCW hypermedia dimensions.  Further, this
   architecture provides an abstract design for creating future systems
   and a model for comparative evaluation of existing systems.

AN University Microfilms Order Number ADGMM-78200.
AU WU, ERIC QIAN.
TI TRANSFORMATION AND BENCHMARK EVALUATION FOR SQL QUERIES.
IN Simon Fraser University (Canada) M.Sc. 1991, 125 pages.
SO MAI v31(04) pp1845.
DE Computer Science.
IS ISBN: 0-315-78200-5.
AB Databases query optimization research has been ongoing for a long
   time.  Nevertheless considerable performance deviations persist
   between retrieval times for different, but logically equivalent,
   expressions of SQL queries.  It would appear that in many actual
   applications the query optimizer cannot efficiently optimize the
   query with respect to retrieval time unless query transformation and
   the physical (index) structure of the database are taken into
   account.  In this thesis an experimental performance study is carried
   out, with the help of the Wisconsin Benchmark, to test which kinds of
   queries are generally more efficient than other logically equivalent
   queries (based on our classification of SQL queries).  This research
   is intended to provide an aid for use in natural language database
   interfaces where automatic SQL query generation results in more
   efficient query transformations to optimize subsequent data
   retrieval.

AN University Microfilms Order Number ADG93-30209.
AU YOON, JONG PIL.
TI CONSTRAINT MANAGEMENT IN ACTIVE DATABASES.
IN George Mason University Ph.D. 1993, 134 pages.
SO DAI v54(06), SecB, pp3198.
DE Computer Science.
AB This dissertation addresses the problem of maintaining the
   consistency of active databases.  An active database consists of a
   set of facts, a set of integrity constraints, and a set of production
   rules.  An active database evolves in that the data are modified
   through updates, and constraints and rules are refined by augmenting
   knowledge discovered from databases.

   The focus of this research is on databases whose organization is
   described by rules and constraints.  As the database is updated,
   changes in the database may violate integrity constraints, and rules
   may be triggered to ensure that the database enters a "consistent
   state," that is, a state in which all integrity constraints are
   satisfied.

   Moreover, as a database is updated, rules and constraints will have
   to represent the database changes.  Knowledge can be discovered from
   a database, in which case, the existing rules and constraints may be
   refined by incorporating discovered knowledge which better represent
   the database organization.

   Two problems are addressed: (1) the database update problem and (2)
   the knowledge discovery problem from a database.  In order to effect
   database updates, one must evaluate all applicable constraints, and
   if some are violated, corrective actions must be taken through
   constraint propagation to update impacted objects so as to maintain
   data consistency.  We propose an update calculus, in which updates
   incorporate rules and constraints efficiently.  A user may issue a
   single update that can be reformulated by the update calculus into a
   sequence of updates.  The update calculus encapsulates pre- and
   post-conditions for an update, repair actions for constraint
   violations and the propagation of update effects.

   The second problem is that of knowledge discovery due to changes in
   a database.  As updates occur in the database, the rules and
   constraints governing the structure and behavior of the database may
   need to be refined.  The approach presented uses knowledge discovery
   techniques to analyze the result of a user query, discover knowledge,
   and incorporate newly discovered knowledge into the existing rules
   and constraints.  We present an algorithm for discovering knowledge
   from query results; a proof-of-concept prototype has been
   implemented.

AN University Microfilms Order Number ADG93-29299.
AU MORNER, CLAUDIA JANE.
TI A TEST OF LIBRARY RESEARCH SKILLS FOR EDUCATION DOCTORAL STUDENTS.
IN Boston College Ph.D. 1993, 188 pages.
SO DAI v54(06), SecA, pp2070.
DE Education, Higher.  Library Science.
   Education, Tests and Measurements.  Education, Teacher Training.
AB A test of library research skills, designed for doctoral students in
   education, was developed in response to recent literature suggesting
   that these students were unprepared to conduct dissertation
   literature reviews.  This test was constructed because no appropriate
   instrument exists to verify the extent of this problem.

   Test development began with a pilot interview study of doctoral
   students investigating their library knowledge, patterns of use, and
   attitudes.  A number of steps were taken to develop the content of
   the test, including reviewing published and unpublished library
   education resources.  Multiple choice test items were written for
   eight content clusters.  The items were pilot tested and an item
   analysis was performed on the results.  The test content was judged
   valid by three highly qualified experts.

   A cluster random sample of 149 education doctoral students from three
   private universities were administered the test during class time.
   The overall response rate was 75%.  The test had a reliability of.72.
   Scores ranged from 14.6% to 82.9% correct, with the average student
   only answering about 50% of the items correctly.  The mean score was
   21.95, the standard deviation, 5.35, and the standard error of
   measurement, 2.8.  No test item discriminated negatively; item
   difficulty ranged from 8.1% to 91.3%.  Mean scores cross-tabulated
   with attitude and demographic data showed little variation by
   subgroups, such as gender or full-time or part-time status.  A
   test-retest indicated a general stability of scores over time.  For
   construct validity the correlation procedure showed reasonably, but
   not exclusively, independent subscales.  Factor analysis did not
   yield statistical corroboration of the subscales.  Criterion related
   validity was successfully established by comparing test results to a
   22 item, in-library performance test.

   Validity and reliability investigations showed that the test, as a
   whole, adequately measured doctoral students' library performance,
   and confirmed previous findings that many education doctoral students
   are not well equipped for doctoral-level library research.  Copies of
   pilot and final tests are in appendices.

AN University Microfilms Order Number ADGMM-78529.
AU GUO, AIQUN.
TI DEVELOPMENT OF A RT-TREE SPATIOTEMPORAL INDEX STRUCTURE FOR A
   LAND-RELATED DATABASE.
IN University of Toronto (Canada) M.A.Sc. 1992, 111 pages.
SO MAI v31(04) pp1865.
DE Engineering, Civil.  Computer Science.
IS ISBN: 0-315-78529-2.
AB A spatiotemporal indexing structure for retrieval and update of
   spatial objects in a land-related database is developed.  The concept
   of RT-tree (Xu et al.  1990) is extended to permit the encoding of
   nested spatial objects, and is able to maintain a historical path in
   the index structure.  The time-stamping method incorporated in the
   RT-tree improves access to historical information.  A small database
   taken from the Automated Canada Lands Information System's Property
   Fabric Information System (PFIS) was used to develop the RT-tree for
   application in a land information context.  The result of using this
   RT-tree on the PFIS database suggests that the RT-tree has potential
   as an index structure for land-related databases where updates are
   common and maintenance of historical path is important.

AN University Microfilms Order Number ADG93-29855.
AU TORGUSON, JEFFREY SCOTT.
TI ASSESSMENT OF CARTOGRAPHIC ANIMATION'S POTENTIAL IN AN ELECTRONIC
   ATLAS ENVIRONMENT.
IN The University of Georgia Ph.D. 1993, 262 pages.
SO DAI v54(06), SecA, pp2279.
DE Geography.  Information Science.  Education, Technology.
   Education, Social Sciences.
AB The role of the academic cartographer combines elements of
   traditional cartographic product development and modern day research
   in map communication.  Advances in microcomputer technology enable
   the incorporation of animated maps into an electronic atlas
   environment.  Animation's potential is tested from both a practical
   development perspective, and a user oriented perspective.  Sixteen
   animation sequences are developed for an hypothetical Atlas of East
   Asia, using three common thematic map types; choropleth, isoline, and
   flow, and one anticipated thematic map type termed the "animated
   symbol" map.  These maps are constructed using GET/PUT animation
   techniques for the isoline and animated symbol maps, and graphics
   palette manipulation for the choropleth and flow maps.  One hundred
   twenty University of Georgia undergraduates were shown four of the
   map sequences, which were displayed randomly, one map from each
   thematic map category.  They were asked to evaluate the map
   subjectively by use of a semantic differential (SD) test, and to take
   a brief geographic content (GC) test about the map sequence.  In
   addition, demographic data, user preference data, and variables for
   map complexity were collected.  The data were statistically analyzed
   at the overall level, the thematic map level, and the individual map
   level.  A lack of association between subjective evaluation of the
   map and geographic information communicated suggests that spatial
   information is transferred regardless of like or dislike of the map.
   Analysis also suggests the subjects preferred isoline and flow maps
   and often performed better using more complex map types.
   Upperclassmen were more critical of the maps than subjects with
   lesser educational attainment, but did not obtain higher GC scores,
   and only subjects with above average geography experience produced
   significantly higher GC scores.  Animation in electronic atlases
   holds potential for users in business, industry, government and
   education.  Geographic information is communicated by the maps in
   spite of varying educational attainment of the user.  A cartographer
   developing an electronic atlas will be able to use the animation
   software developed for this research, the map prototypes, the general
   research results, and the user oriented scores as a basis for his/her
   own map creation.

AN University Microfilms Order Number ADG93-30763.
AU TONTA, YASAR AHMET.
TI AN ANALYSIS OF SEARCH FAILURES IN ONLINE LIBRARY CATALOGS.
IN University of California, Berkeley Ph.D. 1992, 318 pages.
SO DAI v54(06), SecA, pp1985.
DE Library Science.  Information Science.
AB This study investigates the causes of search failures that occur in
   online library catalogs by developing a conceptual model of search
   failures and examines the retrieval performance of an experimental
   online catalog by means of transaction logs, questionnaires, and the
   critical incident technique.  It analyzes retrieval effectiveness of
   228 queries from 45 users by employing precision and recall measures,
   identifying user-designated ineffective searches, and comparing them
   quantitatively and qualitatively with precision and recall ratios for
   corresponding searches.  The dissertation tests the hypothesis that
   users' assessments of retrieval effectiveness differ from retrieval
   performance as measured by precision and recall and that increasing
   the match between the users' vocabulary and that of the system by
   means of clustering and relevance feedback techniques will improve
   the performance and help reduce failures in online catalogs.

   In the experiment half the records retrieved were judged relevant by
   the users (precision) before relevance feedback searches.  Yet, the
   system retrieved only about 25% of the relevant documents in the
   database (recall).  As should be expected, precision ratios decreased
   (18%) while recall ratios increased (45%) as users performed
   relevance feedback searches.  A multiple linear regression model,
   which was developed to examine the relationship between retrieval
   effectiveness and users' judgements of the search performance, found
   that users' assessments of the effectiveness of their searches was
   the most significant factor in explaining precision and recall
   ratios.  Yet, there was no strong correlation between precision and
   recall ratios and user characteristics (i.e., frequency of online
   catalog use and knowledge of online searching) and users' own
   assessments of search performance (i.e., search effectiveness,
   finding what is wanted).  Thus, user characteristics and users'
   assessments of retrieval effectiveness are not adequate measures to
   predict system performance as measured by precision and recall
   ratios.

   The qualitative analysis showed that search failures due to zero
   retrievals and vocabulary mismatch occurred much less frequently in
   the online catalog studied.  It was conducted that classification
   clustering and relevance feedback techniques that are available in
   some probabilistic online catalogs help decrease the number of search
   failures considerably.
**********************************************************
IRLIST Digest is distributed from the University of California,
Division of Library Automation, 300 Lakeside Drive, Oakland, CA.
94612-3550.

Send subscription requests to: LISTSERV@UCOP.EDU
Send submissions to IRLIST to: IR-L@UCOP.EDU
Or send subscription requests and submissions to:
                               NANCY.GUSACK@UCOP.EDU

Editorial Staff:
 Clifford Lynch clifford.lynch@ucop.edu
 Nancy Gusack nancy.gusack@ucop.edu
 Mary Engle mary.engle@ucop.edu

The IRLIST Archives is now set up for anonymous FTP, as well as
via the LISTSERV.

Using anonymous FTP via the host dla.ucop.edu, the files will be
found in the directory pub/irl, stored in subdirectories by year
(e.g., /pub/irl/1993).

Using LISTSERV, send the message INDEX IR-L to LISTSERV@UCOP.EDU.
To get a specific issue listed in the Index, send the message GET
IR-L LOGYYMM, where YY is the year and MM is the numeric month in
which the issue was mailed, to LISTSERV@UCOP.EDU. You will
receive the issues for the entire month you have requested.

These files are not to be sold or used for commercial purposes.
Contact Nancy Gusack or Mary Engle for more information on
IRLIST. THE OPINIONS EXPRESSED IN IRLIST DO NOT REPRESENT THOSE
OF THE EDITORS OR THE UNIVERSITY OF CALIFORNIA. AUTHORS ASSUME
FULL RESPONSIBILITY FOR THE CONTENTS OF THEIR SUBMISSIONS TO
IRLIST.