Information Retrieval List Digest 139 (November 24, 1992) URL = http://hegel.lib.ncsu.edu/stacks/serials/irld/irld-139 IRLIST Digest ISSN 1064-6965 November 24, 1992 Volume IX, Number 42 Issue 139 ********************************************************** I. NOTICES A. Meeting Announcements/Calls for Papers 1. Text Retrieval Conference II. QUERIES B. Requests for Information 1. Voice Recognition Theory IV. PROJECT WORK C. Abstracts 1. IR-Related Dissertation Abstracts ********************************************************** I. NOTICES I.A.1. Fr: Donna Harman X3569 Re: Text Retrieval Conference TEXT RETRIEVAL CONFERENCE January 1993 - August 1993 Conducted by: National Institute of Standards and Technology (NIST) Sponsored by: Defense Advanced Research Projects Agency Software and Intelligent Systems Technology Office (DARPA/SISTO) A new conference for examination of text retrieval methodologies (TREC) was held in November 1992 at Gaithersburg, Md. The goal of this conference was to encourage research in text retrieval from large document collections by providing a large test collection, uniform scoring procedures and a forum for organizations interested in comparing their results. Both ad-hoc queries against archival data collections and routing (filtering or dissemination) queries against incoming data streams was tested. The conference was a workshop open only to the 24 participating systems and government sponsors; however, the proceedings will be published by NIST in the spring of 1993. This announcement serves as a call for participation from groups interested in participating in the second year of this workshop. Participants will be expected to work with approximately million documents (2 gigabytes of data), retrieving lists of documents that could be considered relevant to each of 100 topics (50 routing and 50 adhoc topics). NIST will distribute the data and will collect and analyze the results. As before, the workshop will be open only to participating systems and government sponsors. There will be some minimal support distributed to selected participants in an effort to maximize the number of participants and to attract the widest possible variety of technical approaches and system architectures. This funding is intended only as a supplement to other support. Non-U.S. as well as U.S. participants are eligible for this funding. Schedule: Dec. 5, 1992 -- deadline for applications,including funding requests Jan. 1, 1993 -- acceptances announced, and training data distributed to new participants (including 2 CD-ROMS containing about 2 gigabytes of data, and 100 training topics and relevance judgments) April 1, 1993 -- third gigabyte of data distributed via CD-ROM, after routing queries (see below) received at NIST May 15, 1993 -- 50 test topics distributed June 1, 1993 -- results from 50 routing queries and 50 test topics due at NIST July 30, 1993 -- relevance judgments and individual evaluation scores due back to participants Aug 30-Sept 1., 1993 -- TREC conference at NIST in Gaithersburg, Md. TASK DESCRIPTION: Participants will receive 2 gigabytes of data to use for training of their systems, including development of appropriate algorithms or knowledge bases. The 100 topics used in the first TREC conference, and the relevance judgments for these topics will also be sent. The topics are in the form of a highly-formatted user need statement (see attachment 1). Queries can either be constructed automatically from this topic description, or can be manually constructed. Participants are strongly encouraged to submit at least one run where queries are automatically constructed. Two types of retrieval operations will be tested: a routing or filtering operation against new data, and an ad-hoc query operation against archival data. Fifty of the topics (numbers 51-100) initially distributed as training topics will be used by each participating group to create formalized routing or filtering queries to be used for retrieval against a third gigabyte of data. Fifty new test topics will be used against the 2 gigabytes of training data as ad-hoc queries. Results from both types of queries (routing and ad-hoc) will be submitted to NIST as the top X documents (X to be determined at a later date) retrieved for each query. Participants creating queries both automatically and manually may submit both sets for evaluation. Scoring techniques including traditional recall/precision measures will be run for all systems and individual results will be returned to each participant. CONFERENCE FORMAT: The conference itself will be used as a forum both for presentation of results (including failure analyses and system comparisons), and for more lengthy system presentations describing retrieval techniques used, experiments run using the data, and other issues of interest to researchers in information retrieval. As there is a limited amount of time for these presentations, the program committee will determine which groups are asked to speak and which groups will present in a poster session. Additionally some organizations may not wish to describe their proprietary algorithms, and these groups may chose to participate in a different manner (see Category C). To allow a maximum number of participants, the following three categories have been established. CATEGORY A: FULL PARTICIPATION: Participants will be expected to work with the full data set, and to present full details of system algorithms and various experiments run using the data, either in a talk or in a poster session. In addition to algorithms and experiments, some information on time and effort statistics should be provided. This includes time for data preparation (such as indexing, building a manual thesaurus, building a knowledge base), time for construction of manual queries, query execution time, etc. More details on the desired content of the presentation will be provided later. CATEGORY B: EXPLORATORY GROUPS: Because small groups with novel retrieval techniques might like to participate but may have limited research resources, a category has been set up to work with only a subset of the data. This subset (see data description below), will consist of about 1/2 gigabyte of training data (and all training topics), and 1/4 gigabyte of test data (and all test topics). Participants in this category will be expected to follow the same schedule as category A, except with less data, and will be expected to present full details of system algorithms, experiments, and time and effort statistics either in a poster session or in a talk. Category C: Evaluation only Participants in this category will be expected to work on the full data set, submit results for common scoring and tabulation, and present their results in a poster session, including the time and effort statistics described in Category A. They will not be expected to describe their systems in detail. It is not anticipated that any supplemental funding will be available for this category. DATA (TEST COLLECTION): The test collection (documents, topics, and relevance judgments) will be the same collection (English only) being used for the DARPA TIPSTER project. The collection is being assembled from Linguistic Data Consortium text, and a LDC User Agreement will be required from all participants. The documents will be an assorted collection of newspapers (including the Wall Street Journal), newswires, journals, technical abstracts and email newsgroups. The test set will be of approximately the same composition as the training set, and all documents will be typical of those seen in a real-world situation (i.e. there will not be arcane vocabulary, but there may be missing pieces of text or typographical errors). The format of the documents is relatively clean and easy-to-use as is (see attachment 2). Most of the documents will consist of a text section only, with no titles or other categories. The relevance judgments against which each system's output will be scored will be made by experienced relevance assessors based on the output of all TREC participants using a pooled relevance methodology. RESPONSE FORMAT AND SUBMISSION DETAILS: By Dec. 5, 1992 organizations wishing to participate should respond to the call for participation by submitting a summary of their text retrieval approach and a system architecture description, not to exceed five pages in total. The summary should include the strengths and significance of their approach to text retrieval, and highlight differences between their approach and other retrieval approaches. These summaries will serve as the basis for published proceedings. Opportunity to revise the summaries and add explanations of the results will be provided before publication. Each organization should indicate in which category they wish to participate. Please indicate clearly the persons responsible for the summary statement and to whom correspondence should be directed. A full regular address, telephone number, and an email address should be given. EMAIL IS THE PREFERRED METHOD OF COMMUNICATION, although it is realized that diagrams and figures will need to be sent by regular mail or FAX. It is expected that ALL participants have some access to email, as conference communications will be done via email. Those organizations wishing to apply for funding to supplement their own resources must provide a second statement (not to exceed two pages). This statement should include an estimate of the amount of funding available from other sources to support participation in this work, and a specification of the amount of funding desired. Please clearly indicate whether the organization is interested in participating in TREC even if no funding is available. All responses should be submitted by Dec. 5, 1992 to the Program Chair, Donna Harman: harman@magi.ncsl.nist.gov Donna Harman, NIST, Building 225/A216, Gaithersburg, Md. 20899 FAX: 301-975-2128 AS NOTED ABOVE, EMAIL IS THE DESIRED FORM OF COMMUNICATION. Any questions about conference participation, response format, etc. should also be sent to the same address. ********************************************************** II. QUERIES II.B.1. Fr: Jefferey Lundstrom Re: Voice Recognition Theory Does anyone out there have any information on Computer Voice recognition Theory? I have a friend that is writing a paper on it and she needs some info! Please let me know if you can help!!! Thanks in advance!!! Jeff Lundstrom ********************************************************** IV. PROJECT WORK IV.C.1. Fr: Susanne M. Humphrey Re: Selected IR-Related Dissertation Abstracts The following are citations selected by title and abstract as being related to Information Retrieval (IR), resulting from a computer search, using BRS Information Technologies, of the Dissertation Abstracts Online database produced by University Microfilms International (UMI). Included are UMI order number, title, author, degree, year, institution; number of pages, one or more Dissertation Abstracts International (DAI) subject descriptors chosen by the author, and abstract. Unless otherwise specified, paper or microform copies of dissertations may be ordered from University Microfilms International, Dissertation Copies, Post Office Box 1764, Ann Arbor, MI 48106; telephone for U.S. (except Michigan, Hawaii, Alaska): 1-800-521-3042, for Canada: 1-800-268-6090. Price lists and other ordering and shipping information are in the introduction to the published DAI. An alternate source for copies is sometimes provided. Dissertation titles and abstracts contained here are published with permission of University Microfilms International, publishers of Dissertation Abstracts International (copyright by University Microfilms International), and may not be reproduced without their prior permission. AN University Microfilms Order Number ADGDX-95335. AU BURTON, ALAN. TI A SUBLANGUAGE OF ENGLISH FOR DATABASE QUERY IN A MANAGERIAL ENVIRONMENT. IN Council for National Academic Awards (United Kingdom) Ph.D. 1991, 280 pages. SO DAI V52(11), SecA, pp3996. DE Business Administration, Management. Artificial Intelligence. Computer Science. AB Available from UMI in association with The British Library. This thesis is concerned with the development and testing of a restricted sublanguage of English for Natural Language (NL) database query in a managerial environment. While recognising that for experienced and frequent computer users a concise command language is preferred, it is argued that in a managerial environment NL is the most suitable means of accessing databases. This thesis confirms that NL human-computer communication is not characterised by the forms of complex linguistic behaviour, and sources of ambiguity, which are observed in human-human dialogues. Full NL capabilities are neither necessary nor desirable: what is needed is a naturalistic sublanguage of English, reduced in lexical and syntactic complexity, but nevertheless providing the flexibility of NL input. Most empirical investigations of NL systems to date have been laboratory based using undergraduate students as experimental subjects. Previous investigations have shown that results obtained in the laboratory do not carry over into real work settings. The investigations described in this thesis involve real users (that is, actual or potential system users) accessing live databases, and take place in a setting that is as near as possible to the real work setting. Experiment One gives an insight into the characteristics of the language used by managers for database query, providing a comparison with earlier work. The results draw attention to the significance of dialogue failure, and suggest the need for system developers to adopt a strategy of design-for-failure as opposed to the more conventional design-for-success. The results raise a number of important questions about the use of intersentential linking devices, particularly ellipsis, and suggest the need to address extralinguistic interface issues. Experiment Two focuses on some of the issues raised by Experiment One. In particular, the Experiment tests the hypothesis that an ability to interpret intersentential linking devices, such as anaphora and ellipsis, does not necessarily enhance the usability of an interface. The ATMI (Access to Management Information) Natural Language Interface has been developed in Prolog running on VAX machines under the VMS operating system. It is knowledge based system built around a Definite Clause Grammar, and it is able to handle some complex linguistic phenomena. ATMI accesses Oracle files held at British Gas Engineering Research Station in Newcastle Upon Tyne. AN University Microfilms Order Number ADG92-11415. AU BONNER, ANTHONY J. TI HYPOTHETICAL REASONING IN DEDUCTIVE DATABASES. IN Rutgers The State University of New Jersey - New Brunswick Ph.D. 1991, 244 pages. SO DAI V52(11), SecB, pp5922. DE Computer Science. AB This dissertation addresses a limitation of most deductive database systems: They cannot reason hypothetically. Although they reason effectively about the world as it is, they are poor at tasks such as planning and design, where one must infer the consequences of hypothetical actions and possibilities. For instance, with a typical database query language, a user can retrieve those students "who are eligible to graduate," but cannot retrieve those students "who would be eligible if they took one more course." To express such queries, the dissertation develops a logic programming language in which a user can create hypotheses and draw inferences from them. In addition, we show that the language has several important properties. First, it is more expressive than any database query language based on classical logic, since it can express some simple hypothetical queries that classical logic cannot. Second, it can describe large rulebases concisely, because as we show, hypothetical operations allow a user to specify new rulebases by reusing and modifying old ones. Finally, by imposing syntactic restrictions, the language expresses exactly the database queries in many well-known complexity classes, including polynomial space, exponential time, and the polynomial time hierarchy. AN University Microfilms Order Number ADGNN-61072. AU MCFADYEN, RONALD GARY. TI SEQUENTIAL ACCESS IN FILES USED FOR PARTIAL MATCH RETRIEVAL. IN University of Waterloo (Canada) Ph.D. 1990, 170 pages. SO DAI V52(11), SecB, pp5932. DE Computer Science. IS ISBN: 0-315-61072-7. AB The central theme of the work in this thesis is the use of sequential access for processing a query. A query specifies a number of records to be retrieved, and hence, some number of pages (a response set) must be retrieved to satisfy the query. We consider the response set to consist of a number of clusters, where a cluster is a set of contiguous pages. If a cluster consists of b pages, and if a buffer pool of b pages is available, then the cluster can be retrieved in one disk access. We examine files where records are randomly allocated to pages and develop expressions which measure the clustering present. We then consider static hash files (Gray code hash files and standard binary hash files) and make a comparison to random allocations. Next we consider partial match queries and consider the performance of Gray code and binary hash files for two query distributions. We are concerned with the choice of shuffle function for a multiattribute hash file and its effect on the cost of processing queries. Lastly, we consider the application of Gray code ordering to dynamic hash files. We develop a z-order multiattribute linear hash file employing partial expansions which maintains Gray code ordering. We examine performance for standard file operations and partial match query processing through simulation experiments. AN University Microfilms Order Number ADG92-09894. AU SMADJA, FRANK ALBERT. TI EXTRACTING COLLOCATIONS FROM TEXT. AN APPLICATION: LANGUAGE GENERATION. IN Columbia University Ph.D. 1991, 447 pages. SO DAI V52(11), SecB, pp5937. DE Computer Science. Language, Linguistics. Statistics. AB Natural languages are full of collocations, arbitrary and recurrent combinations of words that co-occur more often than chance. Such combinations correspond to arbitrary word usages and are termed collocations. Recent work in lexicography indicates that collocations are pervasive in English; apparently, they are common in all types of writing, including both technical and non-technical genres. In the dissertation, we describe a set of techniques for retrieving and identifying collocations from large textual corpora. These techniques are based on statistical methods, and identify a wide range of collocations. A statistical filtering technique is described for identifying word pairs involved in a syntactic relation. The words can appear in any order and can be separated by an arbitrary number of other words. Another technique describes how n word collocations (or n-grams) can be identified in a simpler and cheaper way than other methods. The techniques also feature an original method for syntactically labeling and filtering collocations. These techniques have been implemented in a lexicographic tool, Xtract that automatically acquires collocations. Xtract identifies collocations of arbitrary length as well as more flexible collocations. The techniques are described and some results are presented on a 10 million word corpus of stock market news reports. A lexicographic evaluation of Xtract as a collocation retrieval tool estimated the precision of Xtract to be 80%. The evaluation is presented in the dissertation. As a performance task, we demonstrate how such collocations enhance the task of lexical selection in language generation. Previous language generation works were not able to account for co-occurrence knowledge for two principal reasons. They did not have the compiled information and the lexicon formalisms available were not able to properly handle collocational knowledge. The knowledge problem is handled with the use of lexicographic tools such as Xtract, and the representation problem is handled with Functional Unification Grammars (FUGs). We show how the use of FUGs allows to properly handle the interactions of collocational and various other constraints. Finally, we consider several other applications of our work such as computer assisted lexicography, information retrieval, machine translation and spelling correction. ********************************************************** IRLIST Digest is distributed from the University of California, Division of Library Automation, 300 Lakeside Drive, Oakland, CA. 94612-3550. Send subscription requests to: LISTSERV@UCCVMA.BITNET Send submissions to IRLIST to: IR-L@UCCVMA.BITNET Editorial Staff: Clifford Lynch calur@uccmvsa.ucop.edu or calur@uccmvsa.bitnet Nancy Gusack ncgur@uccmvsa.bitnet Mary Engle meeur@uccmvsa.bitnet The IRLIST Archives will be set up for anonymous FTP, and the address will be announced in future issues. To access back issues presently, send the message INDEX IR-L to LISTSERV@UCCVMA.BITNET. To get a specific issue listed in the Index, send the message GET IR-L LOGYYMM, where YY is the year and MM is the numeric month in which the issue was mailed, to LISTSERV@UCCVMA (Bitnet) or LISTSERV@UCCVMA.UCOP.EDU. You will receive the issues for the entire month you have requested. These files are not to be sold or used for commercial purposes. Contact Nancy Gusack or Mary Engle for more information on IRLIST. THE OPINIONS EXPRESSED IN IRLIST DO NOT REPRESENT THOSE OF THE EDITORS OR THE UNIVERSITY OF CALIFORNIA. AUTHORS ASSUME FULL RESPONSIBILITY FOR THE CONTENTS OF THEIR SUBMISSIONS TO IRLIST.