Information Retrieval List Digest 216 (June 6) URL = http://hegel.lib.ncsu.edu/stacks/serials/irld/irld-216 6.1 June 6, 1994 Volume XI, Number 23 Issue 216 ********************************************************** IV. PROJECT WORK A. Abstracts 1. IR-Related Dissertation Abstracts ********************************************************** IV. PROJECTS IV.A.1. Fr: Susanne M. Humphrey Re: Selected IR-Related Dissertation Abstracts The following are citations selected by title and abstract as being related to Information Retrieval (IR), resulting from a computer search, using BRS Information Technologies, of the Dissertation Abstracts Online database produced by University Microfilms International (UMI). Included are UMI order number, title, author, degree, year, institution; number of pages, one or more Dissertation Abstracts International (DAI) subject descriptors chosen by the author, and abstract. Unless otherwise specified, paper or microform copies of dissertations may be ordered from University Microfilms International, Dissertation Copies, Post Office Box 1764, Ann Arbor, MI 48106; telephone for U.S. (except Michigan, Hawaii, Alaska): 1-800-521-3042, for Canada: 1-800-268-6090. Price lists and other ordering and shipping information are in the introduction to the published DAI. An alternate source for copies is sometimes provided. Dissertation titles and abstracts contained here are published with permission of University Microfilms International, publishers of Dissertation Abstracts International (copyright by University Microfilms International), and may not be reproduced without their prior permission. AN University Microfilms Order Number ADG93-30408. AU BENSCH, PETER ALLAN. TI OCCURRENCE-BASED WORD CATEGORIZATION. IN University of California, San Diego Ph.D. 1993, 185 pages. SO DAI v54(06), SecB, pp3175. DE Computer Science. Language, Linguistics. Artificial Intelligence. AB We have embarked on a research program that we call OCCURRENCE-BASED processing. This methodology, quite simply, monitors the contexts in which data elements appear. As such, it is similar to co-occurrence statistical studies, but we do not tally the number of times the data element occurs in the context--we simply record that it has occurred. Thus, one occurrence is the same as 1,000 occurrences. We have been applying this methodology to the task of categorizing words from a natural language. In particular, we have been applying it to corpora consisting of samples of written English text (edited newspaper articles and unedited technical articles). Shifting the emphasis from co-occurrence likelihoods (frequency-based studies) to co-occurrence possibilities (occurrence-based studies) has allowed us to isolate interesting "natural" word categories from moderate-sized corpora. The preliminary investigations mentioned in this dissertation have shown that occurrence-based processing is a research approach that warrants further investigation. AN University Microfilms Order Number ADG93-27155. AU BLAKE, JONATHAN DRESSER. TI CORPUS-BASED EXAMPLE PARSING OF NATURAL LANGUAGE USING BEST-ONLY AND EXHAUSTIVE ALGORITHMS. IN Northwestern University Ph.D. 1993, 196 pages. SO DAI v54(06), SecB, pp3176. DE Computer Science. Language, Linguistics. AB This dissertation discusses a program to parse natural language using data obtained from a collection of pre-parsed training corpora. The data obtained for this research consists of collections of phrases for each of the words in the lexicon. For each of the words in a test sentence, therefore, it is possible to generate a list of possible phrases (based on how the word was used in the training set). Two algorithms are described that combine the lists of phrases to determine possible parses. This combination is similar to a large scale constraint satisfaction problem. The first algorithm is meant to provide baseline information about this procedure. At each stage of the process of attempting to fill the slots of candidate phrases, only the most likely sub-phrase is chosen (the lists of phrases are listed in order of their frequency). This algorithm is recursive. The second algorithm is an exhaustive solution to the problem of merging the possible lists. It is similar to the best-only method mentioned above, but at each stage all possible sub-phrases are returned, not just the most frequent. Two different modes of operating the best-only algorithm are introduced, one of which compares directly to the exhaustive algorithm. The first algorithm provides interesting results, although the performance is less than optimal. The second algorithm is a lot more effective, and the accuracy grows with the increase in the size of the training set. This shows two things. First, the exhaustive method appears to be effective. Second, the data used for this project is quite small, and larger data sets will certainly provide better results. AN University Microfilms Order Number ADG93-31757. AU BRILL, ERIC DAVID. TI A CORPUS-BASED APPROACH TO LANGUAGE LEARNING. IN University of Pennsylvania Ph.D. 1993, 165 pages. SO DAI v54(06), SecB, pp3177. DE Computer Science. Language, Linguistics. AB One goal of computational linguistics is to discover a method for assigning a rich structural annotation to sentences that are presented as simple linear strings of words; meaning can be much more readily extracted from a structurally annotated sentence than from a sentence with no structural information. Also, structure allows for a more in-depth check of the well-formedness of a sentence. There are two phases to assigning these structural annotations: first, a knowledge base is created and second, an algorithm is used to generate a structural annotation for a sentence based upon the facts provided in the knowledge base. Until recently, most knowledge bases were created manually by language experts. These knowledge bases are expensive to create and have not been used effectively in structurally parsing sentences from other than highly restricted domains. The goal of this dissertation is to make significant progress toward designing automata that are able to learn some structural aspects of human language with little human guidance. In particular, we describe a learning algorithm that takes a small structurally annotated corpus of text and a larger unannotated corpus as input, and automatically learns how to assign accurate structural descriptions to sentences not in the training corpus. The main tool we use to automatically discover structural information about language from corpora is transformation-based error-driven learning. The distribution of errors produced by an imperfect annotator is examined to learn an ordered list of transformations that can be applied to provide an accurate structural annotation. We demonstrate the application of this learning algorithm to part of speech tagging and parsing. Successfully applying this technique to create systems that learn could lead to robust, trainable and accurate natural language processing systems. AN University Microfilms Order Number ADG93-29544. AU DOERMANN, DAVID SCOTT. TI DOCUMENT IMAGE UNDERSTANDING: INTEGRATING RECOVERY AND INTERPRETATION. IN University of Maryland College Park Ph.D. 1993, 272 pages. SO DAI v54(06), SecB, pp3180. DE Computer Science. AB Many document image understanding problems require a more comprehensive examination of document features than is typically deemed necessary for recognition tasks. We believe that these problems require a detailed analysis of stroke and sub-stroke features in the document image with the goal of obtaining information about the environment or process which created the document and establishing a context for understanding. We introduce the concept of recovery into the document domain. We provide a "stroke platform" representation which establishes a verifiable "link to the pixels" and demonstrate its usefulness for recovery tasks. This representation allows us to overcome many of the problems associated with the rapid, irreversible abstraction associated with traditional document processing methods and provides the basic framework for our analysis of handwritten documents. By obtaining a detailed description of the document and its properties, we are able to establish a context for analysis and validate assumptions about the domain. This dissertation presents our work on several document image understanding problems: (1) demonstrating the successful use of the stroke platform for the problem of interpreting and reconstructing junctions and endpoints, (2) exploring the effects of the handwriting process on the document by the development of a model for instrument grasp and a study of its effects on pressure features, (3) posing and providing an approach to the problem of recovering temporal information from static images of handwriting, (4) addressing various sub-tasks of the problem of processing form documents, and (5) extending the detailed analysis philosophy to demonstrate its feasibility in related document domains. AN University Microfilms Order Number ADG93-30384. AU HUTCHES, DAVID JOHN. TI DATA STRUCTURES AND ALGORITHMS FOR THE EFFICIENT REPRESENTATION AND RETRIEVAL OF INCREMENTAL LEXICAL INFORMATION. IN University of California, San Diego Ph.D. 1993, 120 pages. SO DAI v54(06), SecB, pp3185. DE Computer Science. AB Ludwig Wittgenstein noted that "One cannot guess how a word functions. One has to look at its use and learn from that" (Wittgenstein 1968, 109). J. R. Firth commented on the meaning of words at the "collocational level" and coined the now oft-repeated phrase "You shall know a word by the company it keeps!" (Firth 1957, 11). In recent years there has been a significant resurgence of interest in the use of statistics for the analysis of linguistic data; with the on-line availability of large collections of text, sophisticated analyses of these corpora are now possible. This fact, coupled with a renewed awareness of the importance of the lexicon in language processing has led to experiments which attempt to cast a variety of linguistic phenomena as statistical in nature. Much work has been done in the statistical analysis of large corpora, but little attention has been paid to the problem of constructing a lexicon which encodes the relationship of words to one another in such a way that these relationships are efficiently stored and retrieved, especially a lexicon based on untagged corpora. The existence of such a lexicon is necessary not only from the theoretical perspective of providing a tool for the statistical analysis of linguistic data, but also as an integral part of many tasks involving natural language processing, such as information retrieval. In the work described here, we attempt to accomplish two interrelated tasks. First, we examine the storage of lexico-statistical information as a computational problem and characterize the data with which one must deal in the processing of large textual corpora; we use this treatment in the construction of a working lexico-statistical database. Second, we validate the assumptions made in building this database by using the information contained therein in the service of a particular linguistic task, a statistically based examination of lexical classification and abstraction. AN University Microfilms Order Number ADG93-29530. AU KERVEN, DAVID SCOTT. TI AN ABSTRACT ARCHITECTURE FOR DISTRIBUTED, OBJECT-ORIENTED HYPERMEDIA SYSTEMS. IN University of Southwestern Louisiana Ph.D. 1993, 224 pages. SO DAI v54(06), SecB, pp3186. DE Computer Science. Information Science. AB The origins of the hypermedia can be traced back to 1945 with the conception of the memex system, a mechanized scientific literature browsing system. However, this system's speed and efficiency were limited by the mechanized nature of its components. With the advent of computers, and later, high-power workstations, the concepts behind memex became realizable. Hypermedia technology has progressed significantly in recent years and been applied to a variety of application domains. This technology is of significant use in the computer supported collaborative work domain since hypermedia environments are capable of supporting established collaborative work models. However, existing systems do not realize this capability to its full potential based upon an examination of a representative set of existing environment against an established set of dimensions for CSCW hypermedia. The primary goal of the research proposed here is to produce a unified, abstract model for a distributed, object-oriented hypermedia environment that will be capable of supporting such collaborative endeavors. Three subgoals were generate to achieve this: the development of an abstract document framework capable of supporting a large scale CSCW hypermedia environment, the development of an abstract distributed architecture for CSCW hypermedia, and the development of template support facilities within the designed framework and architecture. The abstract framework developed provides the capacity for node and link attributes. Further, it incorporates node and link security facilities and allows for ease of interoperability with externally created information objects. Finally, it establishes a standard interface for front-end user interfaces providing for ease of programmability. The resultant distributed architecture allows for the distribution of documents a network in an efficient and user transparent manner, establishes object level concurrent access security, and provides node and link network security facilities. The templating mechanisms incorporated provide the facilities for supporting arbitrary authoring cognitive models. They reduce authoring overhead by providing reusable node and document structuring. Finally, they serve as additional high level navigational tools. The abstract architecture developed is capable of supporting all of the established CSCW hypermedia dimensions. Further, this architecture provides an abstract design for creating future systems and a model for comparative evaluation of existing systems. AN University Microfilms Order Number ADGMM-78200. AU WU, ERIC QIAN. TI TRANSFORMATION AND BENCHMARK EVALUATION FOR SQL QUERIES. IN Simon Fraser University (Canada) M.Sc. 1991, 125 pages. SO MAI v31(04) pp1845. DE Computer Science. IS ISBN: 0-315-78200-5. AB Databases query optimization research has been ongoing for a long time. Nevertheless considerable performance deviations persist between retrieval times for different, but logically equivalent, expressions of SQL queries. It would appear that in many actual applications the query optimizer cannot efficiently optimize the query with respect to retrieval time unless query transformation and the physical (index) structure of the database are taken into account. In this thesis an experimental performance study is carried out, with the help of the Wisconsin Benchmark, to test which kinds of queries are generally more efficient than other logically equivalent queries (based on our classification of SQL queries). This research is intended to provide an aid for use in natural language database interfaces where automatic SQL query generation results in more efficient query transformations to optimize subsequent data retrieval. AN University Microfilms Order Number ADG93-30209. AU YOON, JONG PIL. TI CONSTRAINT MANAGEMENT IN ACTIVE DATABASES. IN George Mason University Ph.D. 1993, 134 pages. SO DAI v54(06), SecB, pp3198. DE Computer Science. AB This dissertation addresses the problem of maintaining the consistency of active databases. An active database consists of a set of facts, a set of integrity constraints, and a set of production rules. An active database evolves in that the data are modified through updates, and constraints and rules are refined by augmenting knowledge discovered from databases. The focus of this research is on databases whose organization is described by rules and constraints. As the database is updated, changes in the database may violate integrity constraints, and rules may be triggered to ensure that the database enters a "consistent state," that is, a state in which all integrity constraints are satisfied. Moreover, as a database is updated, rules and constraints will have to represent the database changes. Knowledge can be discovered from a database, in which case, the existing rules and constraints may be refined by incorporating discovered knowledge which better represent the database organization. Two problems are addressed: (1) the database update problem and (2) the knowledge discovery problem from a database. In order to effect database updates, one must evaluate all applicable constraints, and if some are violated, corrective actions must be taken through constraint propagation to update impacted objects so as to maintain data consistency. We propose an update calculus, in which updates incorporate rules and constraints efficiently. A user may issue a single update that can be reformulated by the update calculus into a sequence of updates. The update calculus encapsulates pre- and post-conditions for an update, repair actions for constraint violations and the propagation of update effects. The second problem is that of knowledge discovery due to changes in a database. As updates occur in the database, the rules and constraints governing the structure and behavior of the database may need to be refined. The approach presented uses knowledge discovery techniques to analyze the result of a user query, discover knowledge, and incorporate newly discovered knowledge into the existing rules and constraints. We present an algorithm for discovering knowledge from query results; a proof-of-concept prototype has been implemented. AN University Microfilms Order Number ADG93-29299. AU MORNER, CLAUDIA JANE. TI A TEST OF LIBRARY RESEARCH SKILLS FOR EDUCATION DOCTORAL STUDENTS. IN Boston College Ph.D. 1993, 188 pages. SO DAI v54(06), SecA, pp2070. DE Education, Higher. Library Science. Education, Tests and Measurements. Education, Teacher Training. AB A test of library research skills, designed for doctoral students in education, was developed in response to recent literature suggesting that these students were unprepared to conduct dissertation literature reviews. This test was constructed because no appropriate instrument exists to verify the extent of this problem. Test development began with a pilot interview study of doctoral students investigating their library knowledge, patterns of use, and attitudes. A number of steps were taken to develop the content of the test, including reviewing published and unpublished library education resources. Multiple choice test items were written for eight content clusters. The items were pilot tested and an item analysis was performed on the results. The test content was judged valid by three highly qualified experts. A cluster random sample of 149 education doctoral students from three private universities were administered the test during class time. The overall response rate was 75%. The test had a reliability of.72. Scores ranged from 14.6% to 82.9% correct, with the average student only answering about 50% of the items correctly. The mean score was 21.95, the standard deviation, 5.35, and the standard error of measurement, 2.8. No test item discriminated negatively; item difficulty ranged from 8.1% to 91.3%. Mean scores cross-tabulated with attitude and demographic data showed little variation by subgroups, such as gender or full-time or part-time status. A test-retest indicated a general stability of scores over time. For construct validity the correlation procedure showed reasonably, but not exclusively, independent subscales. Factor analysis did not yield statistical corroboration of the subscales. Criterion related validity was successfully established by comparing test results to a 22 item, in-library performance test. Validity and reliability investigations showed that the test, as a whole, adequately measured doctoral students' library performance, and confirmed previous findings that many education doctoral students are not well equipped for doctoral-level library research. Copies of pilot and final tests are in appendices. AN University Microfilms Order Number ADGMM-78529. AU GUO, AIQUN. TI DEVELOPMENT OF A RT-TREE SPATIOTEMPORAL INDEX STRUCTURE FOR A LAND-RELATED DATABASE. IN University of Toronto (Canada) M.A.Sc. 1992, 111 pages. SO MAI v31(04) pp1865. DE Engineering, Civil. Computer Science. IS ISBN: 0-315-78529-2. AB A spatiotemporal indexing structure for retrieval and update of spatial objects in a land-related database is developed. The concept of RT-tree (Xu et al. 1990) is extended to permit the encoding of nested spatial objects, and is able to maintain a historical path in the index structure. The time-stamping method incorporated in the RT-tree improves access to historical information. A small database taken from the Automated Canada Lands Information System's Property Fabric Information System (PFIS) was used to develop the RT-tree for application in a land information context. The result of using this RT-tree on the PFIS database suggests that the RT-tree has potential as an index structure for land-related databases where updates are common and maintenance of historical path is important. AN University Microfilms Order Number ADG93-29855. AU TORGUSON, JEFFREY SCOTT. TI ASSESSMENT OF CARTOGRAPHIC ANIMATION'S POTENTIAL IN AN ELECTRONIC ATLAS ENVIRONMENT. IN The University of Georgia Ph.D. 1993, 262 pages. SO DAI v54(06), SecA, pp2279. DE Geography. Information Science. Education, Technology. Education, Social Sciences. AB The role of the academic cartographer combines elements of traditional cartographic product development and modern day research in map communication. Advances in microcomputer technology enable the incorporation of animated maps into an electronic atlas environment. Animation's potential is tested from both a practical development perspective, and a user oriented perspective. Sixteen animation sequences are developed for an hypothetical Atlas of East Asia, using three common thematic map types; choropleth, isoline, and flow, and one anticipated thematic map type termed the "animated symbol" map. These maps are constructed using GET/PUT animation techniques for the isoline and animated symbol maps, and graphics palette manipulation for the choropleth and flow maps. One hundred twenty University of Georgia undergraduates were shown four of the map sequences, which were displayed randomly, one map from each thematic map category. They were asked to evaluate the map subjectively by use of a semantic differential (SD) test, and to take a brief geographic content (GC) test about the map sequence. In addition, demographic data, user preference data, and variables for map complexity were collected. The data were statistically analyzed at the overall level, the thematic map level, and the individual map level. A lack of association between subjective evaluation of the map and geographic information communicated suggests that spatial information is transferred regardless of like or dislike of the map. Analysis also suggests the subjects preferred isoline and flow maps and often performed better using more complex map types. Upperclassmen were more critical of the maps than subjects with lesser educational attainment, but did not obtain higher GC scores, and only subjects with above average geography experience produced significantly higher GC scores. Animation in electronic atlases holds potential for users in business, industry, government and education. Geographic information is communicated by the maps in spite of varying educational attainment of the user. A cartographer developing an electronic atlas will be able to use the animation software developed for this research, the map prototypes, the general research results, and the user oriented scores as a basis for his/her own map creation. AN University Microfilms Order Number ADG93-30763. AU TONTA, YASAR AHMET. TI AN ANALYSIS OF SEARCH FAILURES IN ONLINE LIBRARY CATALOGS. IN University of California, Berkeley Ph.D. 1992, 318 pages. SO DAI v54(06), SecA, pp1985. DE Library Science. Information Science. AB This study investigates the causes of search failures that occur in online library catalogs by developing a conceptual model of search failures and examines the retrieval performance of an experimental online catalog by means of transaction logs, questionnaires, and the critical incident technique. It analyzes retrieval effectiveness of 228 queries from 45 users by employing precision and recall measures, identifying user-designated ineffective searches, and comparing them quantitatively and qualitatively with precision and recall ratios for corresponding searches. The dissertation tests the hypothesis that users' assessments of retrieval effectiveness differ from retrieval performance as measured by precision and recall and that increasing the match between the users' vocabulary and that of the system by means of clustering and relevance feedback techniques will improve the performance and help reduce failures in online catalogs. In the experiment half the records retrieved were judged relevant by the users (precision) before relevance feedback searches. Yet, the system retrieved only about 25% of the relevant documents in the database (recall). As should be expected, precision ratios decreased (18%) while recall ratios increased (45%) as users performed relevance feedback searches. A multiple linear regression model, which was developed to examine the relationship between retrieval effectiveness and users' judgements of the search performance, found that users' assessments of the effectiveness of their searches was the most significant factor in explaining precision and recall ratios. Yet, there was no strong correlation between precision and recall ratios and user characteristics (i.e., frequency of online catalog use and knowledge of online searching) and users' own assessments of search performance (i.e., search effectiveness, finding what is wanted). Thus, user characteristics and users' assessments of retrieval effectiveness are not adequate measures to predict system performance as measured by precision and recall ratios. The qualitative analysis showed that search failures due to zero retrievals and vocabulary mismatch occurred much less frequently in the online catalog studied. It was conducted that classification clustering and relevance feedback techniques that are available in some probabilistic online catalogs help decrease the number of search failures considerably. ********************************************************** IRLIST Digest is distributed from the University of California, Division of Library Automation, 300 Lakeside Drive, Oakland, CA. 94612-3550. Send subscription requests to: LISTSERV@UCOP.EDU Send submissions to IRLIST to: IR-L@UCOP.EDU Or send subscription requests and submissions to: NANCY.GUSACK@UCOP.EDU Editorial Staff: Clifford Lynch clifford.lynch@ucop.edu Nancy Gusack nancy.gusack@ucop.edu Mary Engle mary.engle@ucop.edu The IRLIST Archives is now set up for anonymous FTP, as well as via the LISTSERV. Using anonymous FTP via the host dla.ucop.edu, the files will be found in the directory pub/irl, stored in subdirectories by year (e.g., /pub/irl/1993). Using LISTSERV, send the message INDEX IR-L to LISTSERV@UCOP.EDU. To get a specific issue listed in the Index, send the message GET IR-L LOGYYMM, where YY is the year and MM is the numeric month in which the issue was mailed, to LISTSERV@UCOP.EDU. You will receive the issues for the entire month you have requested. These files are not to be sold or used for commercial purposes. Contact Nancy Gusack or Mary Engle for more information on IRLIST. THE OPINIONS EXPRESSED IN IRLIST DO NOT REPRESENT THOSE OF THE EDITORS OR THE UNIVERSITY OF CALIFORNIA. AUTHORS ASSUME FULL RESPONSIBILITY FOR THE CONTENTS OF THEIR SUBMISSIONS TO IRLIST.