Information Retrieval List Digest 145 (January 5, 1992) URL = http://hegel.lib.ncsu.edu/stacks/serials/irld/irld-145 IRLIST Digest ISSN 1064-6965 January 5, 1992 Volume X, Number 1 Issue 145 ********************************************************** I. NOTICES A. Meeting Announcements/Calls for Papers 1. 5th Message Understanding System Evaluation/ Message Understanding Conference II. QUERIES B. Requests for Information 1. Encryption/Decryption on Top of Fulltext Retrieval Software 2. Request for Informatin in Archives III. JOB ANNOUNCEMENTS 1. Researcher, Siemans Corporate Research, Princeton, New Jersey ********************************************************** I. NOTICES I.A.1. Fr: Beth M. Sundheim Re: 5th Message Understanding Conference--Call for Participation * * * CALL FOR PARTICIPATION * * * FIFTH MESSAGE UNDERSTANDING SYSTEM EVALUATION AND MESSAGE UNDERSTANDING CONFERENCE (MUC-5) 1 MARCH - 27 AUGUST, 1993 Preparation: 1 March - 23 May 29 May - 25 July Evaluations: 24-28 May (dry run) 26-30 July (formal run) Conference: 25-27 August Sponsored by: Defense Advanced Research Projects Agency Software and Intelligent Systems Technology Office (DARPA/SISTO) The Message Understanding Conferences have provided on ongoing forum for assessing the state of the art and practice in text analysis technology and for exchanging information on innovative computational techniques. They have also encouraged experimentation in the context of fully implemented systems that perform the realistic task of extracting factual information from free text. The first two conferences focused on short naval messages; the two most recent conferences challenged the systems with longer and stylistically varied terrorism news stories. The four conferences have seen the application of a wide variety of approaches to the information extraction task. ATTENDANCE AT THE CONFERENCE IS LIMITED TO EVALUATION PARTICIPANTS AND TO GUESTS INVITED BY DARPA. A conference proceedings, including all test results, will be published. Modest amounts of financial support will be made available to selected participants in an effort to maximize the number of participants and to attract the widest possible variety of technical approaches and system architectures. This funding is intended only as a supplement to other support. Both U.S. and non-U.S. participants are eligible for this funding. SCHEDULE: 3 January 1993 Deadline for applications that include funding requests (PAST) 15 January 1993 Final application deadline (no funding requests) 1 February 1993 Notification of acceptance and funding 1 March 1993 Release of system development corpus and evaluation software 24-28 May 1993 Performance evaluation (dry run) on test corpus 26-30 July 1993 Performance evaluation (formal run) on new test corpus 25-27 August 1993 Fifth Message Understanding Conference DATA AND TASK DESCRIPTION: Subject to successful completion of negotiations to obtain proper permissions concerning the data, the data and task to be used for MUC-5 will be the same as those already in use for the data extraction portion of the DARPA/SISTO TIPSTER Text program. There are two languages, English and Japanese, and two domains, joint ventures and microelectronic chip fabrication. These form four separate corpora. The texts are newswire articles selected to produce the desired mix of relevant and nonrelevant texts, and they were blindly divided into pools of development (training) and test data. The task is to extract information about the nature and status of activities in the domain, the entities involved, etc. Analysts have been doing software-assisted manual generation of the "key" templates against which the system-generated templates will be evaluated. The template design is object oriented, and each slot in the template has its own fill specifications for data type, valency, etc. The fill specifications in each domain vary slightly between English and Japanese, reflecting differences in language usage; however, the general design of the template is the same for both languages. An English and a Japanese sample text and corresponding template in the joint ventures domain are available from the program chair (address at end of this announcement). Please specify which language(s) you are interested in. A microelectronics example may be available shortly. The total amount of data that will be available in March to support system development is expected to be between 200 and 1,000 templates and corresponding texts. This number will vary according to the corpus and the data rights that are obtained. To receive the data, participants will be required to acknowledge its copyright status by signing agreements to safeguard the data and to use it for research purposes only. TEST PROTOCOL AND EVALUATION CRITERIA: MUC-5 participants may elect to do either language or both languages; they are limited to selecting just one domain. Participants will have access to TIPSTER Government-Furnished Information and shared resources such as the training texts and templates, task documentation, gazetteers, and evaluation software. TIPSTER data extraction contractors will be participating in MUC-5, for which previously unseen test data will be used. Each test set will consist of 100-300 texts, depending on language and domain. A dry-run test will be conducted about three months after the release of the training data; the formal test will be conducted about two and one-half months after the dry run. Each test will be carried out by the participants at their own sites in accordance with a prepared test procedure and the results submitted to NRaD for official scoring by domain analysts. Systems will be evaluated using the criteria applied to the TIPSTER Text data extraction systems. These criteria, which are still under development, are likely to use the scoring categories (correct, partially correct, incorrect, spurious, missing, and noncommittal) to support not only the measures used for MUC-4 (recall, precision, overgeneration, fallout, and F-measure) but also new measures (probability of detection, probability of false alarm, and a measure that combines them). MUC-5 participants will be able to familiarize themselves with the evaluation criteria through usage of the evaluation software, which will be released along with the training data. INSTRUCTIONS FOR RESPONDING TO THE CALL FOR PARTICIPATION: Organizations within and outside the U.S. are invited to respond to this call for participation. Minimal requirements include development before the dry-run test of a system that can accept texts without manual preprocessing, process them without human intervention, and output templates in the expected format. Organizations should plan on allocating at least three person-months of effort for participation in the evaluation and conference; a substantially greater level of effort is likely to be needed in order to achieve relatively high performance. It is understood that organizations will vary with respect to experience with information extraction, domain expertise/engineering, resources, contractual demands/expectations, etc. Recognition of such factors will be made in any analyses of the results. Organizations wishing to participate in the evaluation and conference must respond by submitting a summary of their text analysis approach and a system architecture description, not to exceed five pages in total. The summary should include the strengths of the approach and highlight its innovative aspects. Acceptance or rejection of each application will be determined on the basis of a technical assessment by the program committee. The body of the application will serve as the basis for an article in the conference proceedings. Participants will have the opportunity to make revisions prior to publication. The application must also include the following information: 1. Domain (choose only one) a. Joint ventures b. Microelectronics 2. Language (choose one or two) a. English b. Japanese 3. An estimate of the degree of coverage and/or length of time under development of existing software to be applied to the MUC-5 task in the selected language(s) and domain. 4. Primary point of contact for notification of acceptance/rejection of application. Please include name, surface and email addresses, and phone and fax numbers. Those organizations wishing to request funding to supplement their own resources must provide a second statement, not to exceed two pages. This statement should include an estimate of the amount of funding available from other sources to support participation in this work and a specification of the amount of funding desired and the minimal acceptable amount. In addition, it should describe any software to be used for MUC-5 that the organization is willing to deliver to NRaD and MUC participants for possible redistribution. Please indicate clearly whether the organization is interested in participating in MUC-5 even if no funding is available. Evaluators of funding requests will not include any MUC system developers. THE DEADLINE FOR OTHER RESPONSES IS JANUARY 15, 1993. All participants are expected to have Internet access and to be able to do electronic file transfer via anonymous FTP. All responses should be submitted to the program chair via email to sundheim@nosc.mil. If Internet access is currently unavailable, responses may be sent via surface mail to Beth Sundheim, NCCOSC/NRaD, Code 444, San Diego, CA 92152-5000, and if a quick reply to questions is needed, the program chair may be reached by phone at 619/553-4145. REFERENCE: _Proceedings_of_the_Fourth_Message_Understanding_Conference_ (MUC-4)_, Morgan Kaufmann, June, 1992. To order, call (800)745-7323 (toll free in North America) or (415)578-9928 (direct), send fax to (415)578-0672 or email to morgan@unix.sri.com. Please refer to ISBN 1-55860-273-9. ********************************************************** II. QUERIES II.B.1. Fr: Chaim Manaster Re: Encryption/Decryption on Top of Fulltext Retrieval Software I need to locate people who have or are working on encryption software that will work on top of text retrieval software of the type usually found on CDROM databases. While I am not a technical sophisticate let me attempt to elaborate to clarify my needs. As I understand things, typically the retrieval software will first create in inverted index file of the textual database, and then when the user inputs his search terms the retrieval engine will quickly search the index, obtain the associated pointer to the original text in the database and retrieve the database. When encryption is introduced into the picture untop of the above retrieval software (without modifying the retrieval software) It would seem that in order that the user enter his query terms in an unencoded fashion the encryption software first nust encrypt his search terms then enter the encrypted** inverted index located the correct entry, decrypt it, locate the associated encrypted text, decrypt it and present that to the end user, ALL DONE ON THE FLY. ** The inserted index must be encrypted as well since its is fairly easy to rebuild the original text solely from the unencrypted inverted index files. Thus it seems to me that any encryption method must be able to use a decryption method that can start to decrypt at any random point in the encrypted file (or a large number of points as an approximation) to pick out some small portion of either the encrypted index or text files without the need do decrypt the entire file, which is usually huge, just to get at a search term in the index or the retrieved textual paragraph from the large database. Would such an encryption scheme of necessity, merely be some form of substitution cypher and therefore not worth writing (too easy to break)? What kind of encryption shemes would be worth considering and is there any software out there (shareware, public domain or commercial) that will do the trick, or is anyone working on a similar project at the moment? One of my main concerns is that the encryption be transparent to the retrieval engine (or at the very least require minor modification if immpossible otherwise). Please Email responses, I am not sure if I can access relevant newsgroups with my sites newsfeed. If I posted this to groups of marginal relevance forgive me but please at the very least suggest the appropriate newsgroups I should post to that don't require special access. P.S. I cross-posted this to sci.crypt comp.compression comp.compression.research Thank you in advance for the help. Henry Manaster Henry Manaster * EMail: manaster@yu1.yu.edu Brooklyn, NY * * Disclaimer: The above is not necessarily MY opinion nor that of anyone else :-) ????! ********** II.B.2. Fr: Ed Haupt Re: Request for Information in Archives For archives, I am seeking addresses, preferably e-mail, for this archive. Records for G.E. Mueller, who was Ordinarius for Philosophie for 1 year (1880-1881) at Tchernowsky (Cernauti, Chernovtsy, Austro-Hungarian name, Czernowitz), now part of the Ukraine, but it was Austria-Hungary at that time, in particular the independent area of Bukovina. Where are the archives? Are they in Wien (because the central government took them back in 1918?), or Budapest (because Bukovina was then considered part of Hungary?), or Bucharest (because Cernauti was part of Romania from 1918 to 1944?), or Kiev (because after 1944, it was part of the Ukraine?), or Moscow (because everything was moved to Moscow), or just simply in Chernovtsy?. Please reply to haupt@pilot.njin.net since I am not a member of this group. Thanks in advance... Edward J. Haupt snail: voice: 1(201) 893-4327 Department of Psychology internet: haupt@pilot.njin.net Montclair State bitnet: haupt@njin 1 Normal Ave. fax: 1(201) 893-5455 Upper Montclair, NJ 07043-1624 USA d 1/16 ********************************************************** III. JOB ANNOUNCEMENTS III.1. Fr: Ellen Voorhees Re: Job announcement Siemens Corporate Research in Princeton, New Jersey is looking to hire an additional researcher for its information retrieval project in the Learning Systems Department. The position requires either a PhD in computer science (information retrieval, knowledge representation, etc.), computational linguistics, or a similar field (preferred) or a masters degree with some experience in a related field. The main responsibility of the successful candidate will be to conduct research in automatic information retrieval and (statistical) natural language processing. Tasks include setting up and running experiments, programming, etc. People interested in the position should send a PLAIN ASCII resume to ellen@learning.siemens.com or a hardcopy of the resume to: Human Services Department EV Siemens Corporate Research, Inc. 755 College Road East Princeton, NJ 08540 Siemens is an equal opportunity employer. ********************************************************** IRLIST Digest is distributed from the University of California, Division of Library Automation, 300 Lakeside Drive, Oakland, CA. 94612-3550. Send subscription requests to: LISTSERV@UCCVMA.BITNET Send submissions to IRLIST to: IR-L@UCCVMA.BITNET Editorial Staff: Clifford Lynch calur@uccmvsa.ucop.edu or calur@uccmvsa.bitnet Nancy Gusack ncgur@uccmvsa.bitnet Mary Engle meeur@uccmvsa.bitnet The IRLIST Archives will be set up for anonymous FTP, and the address will be announced in future issues. To access back issues presently, send the message INDEX IR-L to LISTSERV@UCCVMA.BITNET. To get a specific issue listed in the Index, send the message GET IR-L LOGYYMM, where YY is the year and MM is the numeric month in which the issue was mailed, to LISTSERV@UCCVMA (Bitnet) or LISTSERV@UCCVMA.UCOP.EDU. You will receive the issues for the entire month you have requested. These files are not to be sold or used for commercial purposes. Contact Nancy Gusack or Mary Engle for more information on IRLIST. THE OPINIONS EXPRESSED IN IRLIST DO NOT REPRESENT THOSE OF THE EDITORS OR THE UNIVERSITY OF CALIFORNIA. AUTHORS ASSUME FULL RESPONSIBILITY FOR THE CONTENTS OF THEIR SUBMISSIONS TO IRLIST.