Information Retrieval List Digest 120 (July 13, 1992) URL = http://hegel.lib.ncsu.edu/stacks/serials/irld/irld-120 ========================================================================= Date: Mon, 13 Jul 1992 15:25:43 PST Reply-To: "Information Retrieval List" Sender: "Information Retrieval List" From: IRLIST Subject: IR-L Digest, Vol.IX,No.24, Issue 120 IRLIST Digest July 13, 1992 Volume IX, Number 24 Issue 120 ********************************************************** II. QUERIES B. Requests for Information 1. Multi-Language User Interfaces 2. What is the "State of the Art" in Retrieval? 3. ICAME C. Miscellaneous 1. WINDO IV. PROJECT WORK C. Abstracts 1. IR-Related Dissertation Abstracts ********************************************************** II. QUERIES II.B.1. Fr: Avram Danon Re: Multi-Language User Interfaces We are developing a product whose "standard" language is English but which will be marketed to customers in different European countries (future plans may also include Arabic and Chinese). The user interface (and therefore the choice of language) of the product will differ from customer to customer and will be one fixed language per system. The language specific data will include prompts, report titles, dates (European notation vs. American notation) and warning messages. All texts have been taken out of the "C" code and put into header files so that the code is "language independent" and changes can be done in a central location. We are looking for information from people who have developed similar applications. Specifically: - What is their preferable way of storing and maintaining the various language texts - What is their preferable way of retrieving the data - What pitfalls must one avoid - (How) Can The language selection be done at Run-Time, or must it be done as part of the Compilation - ... and other useful information. Abraham M. Danon, INTERNET: avram@mcil.comm.mot.com, POST: bcms99 PHONE: +972 (3) 565-8727, FAX: +972 (3) 565-8754, (UTC-2) ********** II.B.2. Fr: Patrick Jost Re: What is the "State of the Art" in Retrieval? I'm looking for pointers to articles, systems, people, organizations, ideas, products, whatever that represent either the state of the art or just "good ideas" in the field of information retrieval. I'm not so much interested in search hardware as clever and effective ways to use it, such as query creation, document ranking, dealing with foreign languages and so on. Any email would be appreciated! Thanks! Patrick Jost jost@coyote.trw.com/(310) 812-2759 ..suenos perdidos, dolor infinito, ya sin sentido, en el triste fluir, de los llantos sumergidos, al silencio de un grito en el vacio... (J. L. Velasquez) ********** II.B.3. From: KROVETZ@cs.umass.EDU Subject: ICAME (IR-L Digest, Vol.IX,No.22,Issue 118) Hi. I was the moderator for the panel on Corpus Linguistics, and I'm pleased to see an interest in this area. I think the research being done in Corpus Linguistics is complementary to the work being done in IR, and that community uses a standard set of corpora just as standard corpora are used in IR; these corpora are available through ICAME. The organization publishes an annual journal (ICAME Journal), and the following description is taken from the first page of the latest issue: "ICAME (the International Computer Archive of Modern English) is an organization of linguists and information scientists working with English machine-readable texts. The aim of the organization is to collect and distribute information on English language material available for computer processing and on linguistic research completed or in progess on the material, to compile an archive of English text corpora in machine-readable form, and to make material available to research institutions. The Norwegian Computing Centre for the Humanities in Bergen, Norway, acts as a distribution centre for computerized English language corpora and corpus-related software. It publishes the ICAME Journal (previously ICAME News) and maintains an electronic information service. Conferences have been arranged since 1979." A survey of existing corpora is available from Knut Hofland: FAFSRV@NOBERGEN.BITNET. The corpora are not available for anonymous FTP, but can be purchased for a moderate fee. For example, I think the untagged LOB corpus is available on tape for about $50. A CD-ROM of the major collections, both raw format and indexed, and several software packages is about $600. The major collections are: The Brown Corpus, the LOB Corpus, the London-Lund Corpus, the Kolhapur Corpus, and the Helsinki Corpus. The Brown Corpus (from Brown University) is the oldest and contains about 1 million words of American English broken down into 500 texts of 2000 words each; it is arranged according to a number of different genres (fiction, newspaper text, scientific text, etc.). The LOB is a counterpart to the Brown Corpus, except that it is British English. The Kolhapur Corpus is Indian English, and the Helsinki Corpus is diachronic (Old English and Middle English). The London-Lund Corpus is transcribed spoken English. The CD-ROM also contains a tagged version of the LOB corpus (that is, each word of the corpus has been annotated with its part-of-speech). The proceedings of previous ICAME conferences have been published in book form, and those are given at the end of paper in the SIGIR proceedings. ICAME also maintains a bibliography of the literature in Corpus Linguistics, and that is available through Knut Hofland as well. I would be glad to answer any other questions about Corpus Linguistics and its relation to IR. Bob krovetz@cs.umass.edu ********** II.C.1. Fr: Stavros Macrakis Re: WINDO (IR-L Digest, Vol.IX,No.19,Issue 115) James Love reprinted an editorial from the (Trenton) Times which says: "Rep. Rose's bill, HR 2772, would create something called the Wide Information Network for Data Online (WINDO) [which] would ..establish a one-stop shopping window for federal databases." I certainly sympathize with the intent of making public information more readily accessible. But does the proposal achieve that goal? If the argument is that the service should be available more cheaply to all citizens, giving subsidies to existing services would seem the best way to proceed (perhaps in the form of searching budgets for deposit libraries). If the argument is that current services are overcharging, I'd like to understand why entrepreneurs aren't undercutting their prices. If the barrier to entry for entrepreneurs is capital costs, do we really want to finance those costs through a further expansion of the national debt? -s ********************************************************** IV. PROJECT WORK Fr: Susanne M. Humphrey Re: Selected IR-Related Dissertation Abstracts The following are citations selected by title and abstract as being related to Information Retrieval (IR), resulting from a computer search, using BRS Information Technologies, of the Dissertation Abstracts Online database produced by University Microfilms International (UMI). Included are UMI order number, title, author, degree, year, institution; number of pages, one or more Dissertation Abstracts International (DAI) subject descriptors chosen by the author, and abstract. Unless otherwise specified, paper or microform copies of dissertations may be ordered from University Microfilms International, Dissertation Copies, Post Office Box 1764, Ann Arbor, MI 48106; telephone for U.S. (except Michigan, Hawaii, Alaska): 1-800-521-3042, for Canada: 1-800-268-6090. Price lists and other ordering and shipping information are in the introduction to the published DAI. An alternate source for copies is sometimes provided. Dissertation titles and abstracts contained here are published with permission of University Microfilms International, publishers of Dissertation Abstracts International (copyright by University Microfilms International), and may not be reproduced without their prior permission. AN University Microfilms Order Number ADG91-25202. AU KABALISWARAN, R. TI PARADIGM ANALYSIS IN ORGANIZATION THEORY. IN New York University, Graduate School of Business Administration Ph.D. 1991, 256 pages. SO DAI V52(04), SecA, pp1429. DE Business Administration, Management. Information Science. Sociology, Theory and Methods. AB This study investigates the paradigmatic content of organization theory as revealed in the articles of two management journals, namely, Administrative Science Quarterly and Academy of Management Journal. A random sample of forty articles from each journal was chosen and subjected to content analysis of their meta-theoretical, methodological, and disciplinary-base dimensions. Both journals yielded three identifiable clusters of articles based on the following disciplines: sociology, political science, and psychology. Academy of Management Journal articles showed a strong psychology cluster and weak political science and sociology clusters while Administrative Science Quarterly showed a more even spread across all three disciplines. The differences between the two journals in their disciplinary affinity were further corroborated by a citation analysis of the references cited in the sample articles. AN University Microfilms Order Number ADGDX-92934. AU PARKES, ALAN PHILIP. TI AN ARTIFICIAL INTELLIGENCE APPROACH TO THE CONCEPTUAL DESCRIPTION OF VIDEODISC IMAGES. IN University of Lancaster (United Kingdom) Ph.D. 1988, 216 pages. SO DAI V52(04), SecA, pp1114. DE Cinema. Artificial Intelligence. AB Available from UMI in association with The British Library. This thesis represents an Artificial Intelligence approach to the conceptual description and computerised discussion and retrieval of videodisc still frames and moving films. The research herein assumes technological facilities of a computer-controlled videodisc player with a frame addressable retrieval facility. The research draws on cinema theory (in order to define the class of films to be considered); on structural and cognitive psychology (to attempt to ascertain the describable nature of the film and picture); and on Artificial Intelligence (conceptual graph theory, scripts and temporal logic). The ultimate aim of the research is to facilitate an "intelligent" user-responsive "interactive video" system, ultimately to be used in training and education. A simple logic based temporally-hierarchical moving film representation language is introduced, this language forming the basis of an intelligent prototype system, "CLORIS", which is also described in the thesis. The bulk of the thesis is concerned with laying down the methodological assumptions and definitions which should form the basis of attempts to describe the conceptual and visual contents of film. AN University Microfilms Order Number ADG91-26477. AU BELL, JOHN EDWARD. TI A CASE STUDY OF AD HOC QUERY INTERFACES TO DATABASES. IN University of California, Berkeley Ph.D. 1990, 207 pages. SO DAI V52(04), SecB, pp2132. DE Computer Science. AB This dissertation describes the research we performed studying three different ad hoc query interfaces to databases. We wanted to determine which of the interfaces was easiest to learn and which one allowed subjects to be most productive. The interfaces we used were all commercial query languages and they were chosen to represent three different interface models: for an artificial language (i.e., formal computer language) interface we used SQL, for a graphical language interface we used Simplify, and for natural language interface, (i.e., English) we used DataTalker. Subjects were used taught how to use one of the three interfaces during a learning phase of the experiment, and they were tested on that interface during a performance phase. The subjects ranged in ability from no computer experience at all to experts on each of the interfaces studied. They were tested over a range of tasks from simple single table queries through complex joins. We analyzed the quantitative results of the performance of the subjects in terms of their ability to complete the tasks and in terms of their ability of complete the tasks and in terms of the time they needed to complete the successful tasks. We also studied the qualitative data drawn from observations we made of the subjects during the experiment and comments made by the subjects both during and after the experiment. The results showed that though no interface was uniformly best, different groups of users did best with each interface. Expert users on each interface and subjects who had not used computers before did best when using SQL, users familiar with SQL and programmers did best when using Simplify, and computer users without programming experience did best when using DataTalker. We conclude that the graphical and natural language interfaces show promise as a better interface overall than the artificial language. Today, however, an artificial language still has the best general applicability. ftn*Research supported in part by a grant from BP America and the National Science Foundation under grant MIP 87-15557. AN University Microfilms Order Number ADG91-27095. AU CLIFTON, CHRISTOPHER WADE. TI HYPERFILE, A DATABASE MANAGER FOR DOCUMENTS. IN Princeton University Ph.D. 1991, 107 pages. SO DAI V52(04), SecB, pp2134. DE Computer Science. AB Documents, pictures, and other such non-quantitative information pose interesting new problems in the database world. Such data has traditionally been stored in file systems, which do not provide the security, integrity, or query features of database management systems. We have developed HyperFile, a data server that provides query facilities (as well as some other database features) while maintaining the flexibility and efficiency of a file system. HyperFile is based on the hypertext notion of free-form objects connected by links. Hypertext systems "query" their database by browsing (reading objects and following links.) We present a query interface that maintains much of the flavor of browsing, allowing the user to specify a single query rather than manually following links. This eliminates the repeated user interactions of hypertext browsing, and allows the hypertext model to be extended to larger and less structured databases. An algorithm for processing HyperFile queries is presented. We also show how to extend this algorithm for distributed query processing, and present experimental results from a distributed HyperFile server. Another issue explored is indexing. In HyperFile, searches are often demarcated by pointers between items. Thus the scope of the search may change dynamically, whereas traditional indexes cover a statically defined region such as a relation. This demands new indexing techniques. Some ideas on indexing in HyperFile are presented, as well as experiments in a large HyperFile database. Also presented is a sample HyperFile application. This is a "browser" that uses menus to guide the user in constructing HyperFile queries. AN University Microfilms Order Number ADG91-29355. AU FRAIL, ROBERT P. TI TEXT CLASSIFICATION IN FRAGMENTED SUBLANGUAGE DOMAINS. IN Polytechnic University Ph.D. 1991, 118 pages. SO DAI V52(04), SecB, pp2137. DE Computer Science. Language, Linguistics. AB On-line information processing systems provide rapid, easy access to highly structured repositories of information. The recent maturing of text processing technologies (word processors, desktop publishing systems, optical character readers, document scanners) has created an explosion of on-line texts, containing much information useful for databases and other applications. However, a human is usually required to separate the useful from the useless, a tedious and costly task. Text classification systems attempt to solve this problem by automatically extracting the essential concepts of interest defined for a domain of texts. Such systems typically require a high level of linguistic and programming expertise to deploy, and are domain-specific. In this dissertation, we identify the textual attribute most responsible for accuracy and robustness in text classification systems as conceptual predictability. We then define an important new text domain called a fragmented sublanguage domain, for which current technologies are of limited or no use. We then present an innovative new technology capable of robust classification within this new text domain, yet highly applicable to domains of more "natural" language. Our major contributions are a parallel parsing technique that provides a high degree of immunity to textual noise. We develop a new grammar formalism, called Extended Regular Expression Transduction Grammar, that enables the highly perspicuous expression of concepts having high structural complexity, yet is equivalent in power to context-free grammar. We then present an innovative composite transducer built from an interconnected hierarchy of grammars called an Order-N Recursive Transducer. This multi-level concept recognition engine is applicable to the classification of texts having low conceptual predictability, as well as other text processing tasks. We also show a technique for the supervised learning of unrecognized lexemes that leverages the differences in similarly defined word classes to flag potentially new members of a syntactic category. Finally, we demonstrate the efficacy of our theoretical models and algorithms in two ways: first, we present a grammar compiler that automatically generates executable text summarization systems, in the form of Order-N Recursive Transducers, from a hierarchy of grammars expressed in our notation. We then present three compiler-generated applications in three different domains of text. ********************************************************** IRLIST Digest is distributed from the University of California, Division of Library Automation, 300 Lakeside Drive, Oakland, CA. 94612-3550. Send subscription requests to: LISTSERV@UCCVMA.BITNET Send submissions to IRLIST to: IR-L@UCCVMA.BITNET Editorial Staff: Clifford Lynch lynch@uccmvsa.ucop.edu or calur@uccmvsa.bitnet Nancy Gusack ncgur@uccmvsa.bitnet Mary Engle engle@cmsa.berkeley.edu or meeur@uccmvsa.bitnet The IRLIST Archives will be set up for anonymous FTP, and the address will be announced in future issues. To access back issues presently, send the message INDEX IR-L to LISTSERV@UCCVMA.BITNET. To get a specific issue listed in the Index, send the message GET IR-L LOG ***, where *** is the month and day on which the issue was mailed, to LISTSERV@UCCVMA.BITNET. These files are not to be sold or used for commercial purposes. Contact Nancy Gusack or Mary Engle for more information on IRLIST. The opinions expressed in IRLIST do not represent those of the editors or the University of California. Authors assume full responsibility for the contents of their submissions to IRLIST.