Information Retrieval List Digest 143 (December 22, 1992) URL = http://hegel.lib.ncsu.edu/stacks/serials/irld/irld-143 IRLIST Digest ISSN 1064-6965 December 22, 1992 Volume IX, Number 47 Issue 143 ********************************************************** I. NOTICES A. Meeting Announcements/Calls for Papers 1. ACM TIS Special Issue on Text Categorization 2. NATO Advanced Research Workshop on Burning Issues in Discourse, Maratea, Italy, April 13-15, 1993 B. Publications Announcements 1. Proceedings, Astronomy from Large Databases II. QUERIES A. Questions and Answers 1. Tools for Data Retrieval from IBM B. Requests for Information 1. Retrieval of Noisy Text Data III. JOB ANNOUNCEMENTS 1. Continuum Productions, Corp. 2. Drexel University, Information Studies IV. PROJECT WORK C. Abstracts 1. IR-Related Dissertation Abstracts ********************************************************** I. NOTICES I.A.1. Fr: David Lewis Re: Call For Papers: ACM TIS Special Issue on Text Categorization Call For Papers Special Issue on Text Categorization ACM Transactions on Information Systems Submissions due: June 1, 1993 Text categorization is the classification of units of natural language text with respect to a set of pre-existing categories. Reducing an infinite set of possible natural language inputs to a small set of categories is a central strategy in computational systems that process natural language. Some uses of text categorization have been: --To assign subject categories to documents in support of text retrieval and library organization, or to aid the human assignment of such categories. --To route messages, news stories, or other continuous streams of texts to interested recipients. --As a component in natural language processing systems, to filter out nonrelevant texts and parts of texts, to route texts to category-specific processing mechanisms, or to extract limited forms of information. --As an aid in lexical analysis tasks, such as word sense disambiguation. --To categorize nontextual entities by textual annotations, for instance to assign people to occupational categories based on free text responses to survey questions. ACM Transactions on Information Systems is the leading forum for presenting research on text processing systems. For this special issue we encourage the submission of high quality technical descriptions of algorithms and methods for text categorization. Experiments comparing alternative methods are especially welcome, as are results on deploying systems into regular use. Five copies of each manuscript should be submitted to either of the special issue editors at the addresses below: David D. Lewis Philip J. Hayes AT&T Bell Laboratories Carnegie Group, Inc. 600 Mountain Ave. Five PPG Place Room 2C409 Pittsburgh, PA 15222 Murray Hill, NJ 07974 USA USA hayes@cgi.com lewis@research.att.com Submission June 1, 1993 Notificatin Octobber 1, 1993 Revision Februrary 1, 1994 Publication mid-1994 The July 1990 issue of TIS contains a description of the style requirements. ********** I.A.2. Fr: Eduard Hovy Re: NATO Advanced Research Workshop, Maratea, Italy, April 13-15, 1993 NATO ADVANCED RESEARCH WORKSHOP on BURNING ISSUES IN DISCOURSE Maratea, Italy 13th - 15th April, 1993 Directors: Prof. Donia Scott (ITRI, University of Brighton) Dr. Eduard Hovy (ISI, University of Southern California) OBJECTIVES: Researchers of computational discourse are currently grappling with issues that in many cases are also being addressed, and perhaps even solved, in other subdisciplines of linguistics. The aim of this workshop is to facilitate cross-disciplinary interactions, and simply to learn from one another. The intention is not to produce a grand new theory, but rather to inform one another about the facets of the problem and available methods of addressing them. Among the issues to be discussed are: 1. Multi-Party Discourse: The collaborative construction of a coherent discourse involves several factors that complicate the single-speaker picture. How well do current theories account for these phenomena? Can they be used in computational systems? What needs to be added, and how can the open questions be addressed in testable ways? 2. Discourse Segmentation: Coherent discourse is structured. What does this structure look like? How are the structural segments defined? What are the relevant units of segmentation? How are their boundaries signalled, and what information do the boundaries constrain? What role does communicative intentionality play in the segmentation? 3. Intersegment Relatedness: Discourse segments are related in particular ways to give structure to the discourse. What is the nature of the intersegment relations? What relations do people use, and how can suggested relations be validated? Is it possible to construct grammars of discourse using these relations? 4. Information in Discourse: Information is not presented randomly within discourse segments, and segments themselves are not randomly ordered. What governs the flow of information? What is the difference between notions such as Topic, Theme, Focus, and Given? How does information presentation (by the speaker) influence information access (of the hearer)? 5. Discourse Structure and Syntactic Form: How do discourse and How can one identify correlations between them and specify the correlations as rules for, say, automated discourse generation? 6. Tools, Techniques, and Experimental Methodologies: How can theories of discourse be empirically verified? All the above mentioned topics can benefit from the development and application of objective testing techniques. What techniques and methodologies exist? What aspects of discourse do they best address? The total number of participants will be limited to about 50. PUBLICATION: The proceedings of this workshop will be published in the NATO ASI series. FEE: There is no registration fee for members of academic institutions and a nominal fee of LIT 100,000 ($75, 50) for other participants. APPLICATION: Due to the nature of the workshop, only a limited number of participants can be accomodated. Interested participants should send a short vita, mentioning their present nationality, and a short statement of (a) their approach to and perspectives on each of the discussion issues outlined above and (b) which among these is the most burning issue(s) for them. A deposit of 100 will be required, issued as a cheque (in pounds sterling) payable to "NATO ARW". The deposit is returnable to non-accepted applicants. Participants must stay for the entire period of the workshop. Closing date for applications is 31 December 1992. No special application form is required. Successful applicants will be informed by 18 January 1993. Applications and requests for further information should be directed to: Dereen Taylor, Research Administrator, IT Research Institute, University of Brighton, Lewes Road, Brighton, BN2 4AT tel: (+44 -273) 642900 fax: (+44 -273) 606653 email: burning.issues@itri.bton.ac.uk ********** I.B.1. Fr: F. Murtagh Re: Astronomy from Large Databases II, Proceedings Astronomy From Large Databases II: The proceedings of this Workshop, held at Haguenau, France, from September 14-16, have just become available. The 534-page volume is edited by A. Heck and F. Murtagh. Address questions about price and purchase arrangements to the attention of ESO, Financial Services, Karl-Schwarzschild-Str. 2, Garching/Muenchen, Germany. ********************************************************** II. QUERIES II.A.1. Fr: Liu Zi-Di Re: Tools for Data Retrieval from IBM Dear Netters, I'll have to build a simple data retrieval application on VM/CMS, which provides a menu-driven interface to end users. I don't think we need to install complex SQL/DS. Anybody know of better tools or program product from IBM HESC can perform the function? Any suggestion will be very much appreciated. Regards, --Liu Zi-Di ********** II.B.1. Fr: David Lewis Re: Retrieval of Noisy Text Data Hi -- I'm interested in seeing references on text retrieval and text categorization with noisy or distorted text data. Examples of such data would be: output from speech recognition systems, output from optical character recognition, keypunched data with lots of typographical errors, and data corrupted by line noise. Please send references to me and I will post a summary to this list. thanks, Dave David D. Lewis email: lewis@research.att.com AT&T Bell Laboratories ph. 908-582-3976 600 Mountain Ave.; Room 2C409 Murray Hill, NJ 07974; USA ********************************************************** III. JOB ANNOUNCEMENTS III.1 From: Kody Janney (Continuum) Subject: IR-list submission Note: For those who have seen this job announcement before, Continuum Productions was known as Interactive Home Systems (IHS) until last month. Continuum Productions, Corp., an equal opportunity employer, has positions open for bright, innovative people with experience in any of the following areas: * Computer databases with emphasis in retrieval concepts * Image management and retrieval * Thesaurus construction and maintenance * Image classification system development and maintenance * End-user searching issues * Information retrieval Continuum Productions is building a comprehensive visual information library, applying state of the art digital technologies and advanced software to enhance the display and enjoyment of a broad spectrum of subjects. This material will be made available to a wide audience through existing and emerging technologies. If you are interested in putting theory into practice in a dynamic, creative, fast-growing start-up please send your resume to: Information Manager Continuum Productions, Corp. 15395 SE 30th Place Suite 300 Bellevue WA 98007 ********** III.2 Fr: Kate McCain Re: Faculty Position, College of Information Studies, Drexel U. ORGANIZATION OF KNOWLEDGE Drexel University's College of Information Studies seeks applicants for a full- time tenure-track position in the broad area of the organization of knowledge. It is anticipated that the position will be filled at the assistant or associate level by a person with an appropriate earned or nearly earned PhD. The candidate will have responsibility for teaching, research and development in some combination of the following areas: * classification theory and knowledge representation * subject access systems design * organization of information for access and retrieval * bibliographic cataloging and classification * cognitive aspects of information-seeking behavior * text analysis for information retrieval The College, which offers a BS and MS in information systems, an ALA-accredited MS in library and information science, and a PhD, recognizes the pivotal role that information plays in society. Our view of information is broad, multi- disciplinary, and practical, and we are committed to reinventing education for information professionals. The successful candidate will participate in all teaching programs and will be expected to lead research and development activities in these domains. He or she will be responsible for oversight and some teaching of graduate courses in the organization of knowledge. Please submit a letter of application, curriculum vitae, and names, addresses, and phone numbers of at least three references to Katherine W. McCain, PhD, Co-Chair, Faculty Search Committee, College of Information Studies, Drexel University, Philadelphia, PA 19104. Review of applicants will begin February 15, 1993. Applications will be considered until the position is filled. Drexel University is an equal opportunity employer. Women and minorities are encouraged to apply. ********************************************************** IV. PROJECT WORK IV.C.1. Fr: Susanne M. Humphrey Re: Selected IR-Related Dissertation Abstracts The following are citations selected by title and abstract as being related to Information Retrieval (IR), resulting from a computer search, using BRS Information Technologies, of the Dissertation Abstracts Online database produced by University Microfilms International (UMI). Included are UMI order number, title, author, degree, year, institution; number of pages, one or more Dissertation Abstracts International (DAI) subject descriptors chosen by the author, and abstract. Unless otherwise specified, paper or microform copies of dissertations may be ordered from University Microfilms International, Dissertation Copies, Post Office Box 1764, Ann Arbor, MI 48106; telephone for U.S. (except Michigan, Hawaii, Alaska): 1-800-521-3042, for Canada: 1-800-268-6090. Price lists and other ordering and shipping information are in the introduction to the published DAI. An alternate source for copies is sometimes provided. Dissertation titles and abstracts contained here are published with permission of University Microfilms International, publishers of Dissertation Abstracts International (copyright by University Microfilms International), and may not be reproduced without their prior permission. AN This item is not available from University Microfilms International ADG05-71521. AU BRENT, MICHAEL RICHARD. TI AUTOMATIC ACQUISITION OF SUBCATEGORIZATION FRAMES FROM UNRESTRICTED ENGLISH. IN Massachusetts Institute of Technology Ph.D. 1991. SO DAI V53(02), SecB, pp929. DE Computer Science. Artificial Intelligence. AB This thesis describes an implemented system that automatically acquires syntactic features of verbs, features essential for parsing, from unrestricted English text. Among the features it learns are those responsible for the distinct parses the following sentences: John told ($\quad\sb$\rm NP$ the man who mumbles) to arrive early John doubted ($\quad\sb$\rm NP$ the man who likes to arrive early) The lack of adequate dictionaries containing these features has been a major bottleneck for natural language processing. The system described here can learn these features without supervision from completely unrestricted English text, no matter how novel the content or vocabulary. This is a problem that has never before been attacked. A new type of natural language system is developed to solve this new problem, and the system is shown to be effective in acquiring a variety of subcategorization frames for hundreds of verbs. Acquiring subcategorization frames independent of the lexical familiarity of the input poses many challenges, including lexical and syntactic ambiguity and typographical errors in the input. These challenges are met by a system of overlapping safety-nets. The safety nets including: (1) Waiting for syntactically unambiguous examples whenever possible; (2) Filtering hypothesized verbs by morphological criteria; (3) Using statistical models to filter out occasional random errors. All of these techniques exploit the fact that evidence for each feature of each verb exists in many copies throughout the input. The system developed here draws conclusions from the aggregate evidence provided by many input sentences. This approach lends a suppleness not available to natural language systems that try to draw perfect conclusions from every input sentence. In addition to developing new techniques and lexical resources for natural language processing, this work may lead to valuable techniques in the linguistic and psychological study of the lexicon. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.). AN University Microfilms Order Number ADG92-20627. AU CHEN, QIFAN. TI AN OBJECT-ORIENTED DATABASE SYSTEM FOR EFFICIENT INFORMATION RETRIEVAL APPLICATIONS. IN Virginia Polytechnic Institute and State University Ph.D. 1992 253 pages. SO DAI V53(02), SecB, pp930. DE Computer Science. AB This dissertation deals with the application of object-oriented database techniques to the problem of storage and access of information retrieval (IR) data, especially data that can be organized as a graph, such as a thesaurus encoded in semantic networks, or hypertext collections. Even traditional IR models can use graph representations of documents and concepts. This dissertation reports the development of an object-oriented model called the LEND (Large External object-oriented Network Database) model. This model contains not only features found in a typical object-oriented model but also those that specifically are designed for graph-structured data. A query language is provided facilitating the specification of graph-oriented queries. A prototype LEND system has been implemented to test the model on realistic graph-structured data. It adopts an open system architecture and design, and is easily extensible, like the LEND model itself. The research result of suitable data structures and algorithms (a class of minimal perfect hashing functions) for the efficient implementation of the LEND model is also reported. These data structures and algorithms enable retrieval of a node or a set of nodes in an optimal fashion. Placement of a large graph on a disk is studied as well. The method developed permits efficient traversal of graphs. ********************************************************** IRLIST Digest is distributed from the University of California, Division of Library Automation, 300 Lakeside Drive, Oakland, CA. 94612-3550. Send subscription requests to: LISTSERV@UCCVMA.BITNET Send submissions to IRLIST to: IR-L@UCCVMA.BITNET Editorial Staff: Clifford Lynch calur@uccmvsa.ucop.edu or calur@uccmvsa.bitnet Nancy Gusack ncgur@uccmvsa.bitnet Mary Engle meeur@uccmvsa.bitnet The IRLIST Archives will be set up for anonymous FTP, and the address will be announced in future issues. To access back issues presently, send the message INDEX IR-L to LISTSERV@UCCVMA.BITNET. To get a specific issue listed in the Index, send the message GET IR-L LOGYYMM, where YY is the year and MM is the numeric month in which the issue was mailed, to LISTSERV@UCCVMA (Bitnet) or LISTSERV@UCCVMA.UCOP.EDU. You will receive the issues for the entire month you have requested. These files are not to be sold or used for commercial purposes. Contact Nancy Gusack or Mary Engle for more information on IRLIST. THE OPINIONS EXPRESSED IN IRLIST DO NOT REPRESENT THOSE OF THE EDITORS OR THE UNIVERSITY OF CALIFORNIA. AUTHORS ASSUME FULL RESPONSIBILITY FOR THE CONTENTS OF THEIR SUBMISSIONS TO IRLIST.