Information Retrieval List Digest 010 (February 14, 1990) URL = http://hegel.lib.ncsu.edu/stacks/serials/irld/irld-010 IRLIST Digest February 14, 1990 Volume VII Number 4 Issue 10 *************************************************************** Continued from Volume VII Number 4, Issue 9 *************************************************************** IV. PROJECTS: Initiatives and proposals / Bibliographies Abstracts / Miscellaneous D.4. Survey Corpora (continued) *************************************************************** IV.D.4. (continued) Fr: FAFSRV%NOBERGEN.BITNET@CUNYVM.CUNY.EDU Re: SURVEY CORPORA --------------------------------------------------------------------- MELBOURNE-SURREY CORPUS --------------------------------------------------------------------- Compiled by: G. G. Corbett, Khurshid Ahmad Compiled at: Department of Linguistic and International Studies, and Computing Unit, University of Surrey, Guildford, Surrey GU2 5XH. Date of compilation: Sampling period: 1980-81 Language (variety): Australian English Spoken/written: Written Size: c. 100,000 words Details of material: Taken from the newspaper "The Age" published in Melbourne. The texts are all editorials which appeared from Sept. 1, 1980 to Jan. 30, 1981. Organisation: Stored in 93 separate files. Each file consists of two editorials selected on the same day. How transcribed: Ordinary written text. How analysed: Use of corpus: Of value to those working on varieties of English, and should complement the work being done on spoken Australian English. Availability: Distributed through ICAME. Available for research purposes. Other: Material is all in uppercase, but upper/lower case information is available in the originals which are lodged with ICAME. --------------------------------------------------------------------- NIJMEGEN CORPUS --------------------------------------------------------------------- Compiled by: J Aarts and others Compiled at: University of Nijmegen Date of compilation: Language (variety): British English Spoken/written: Written Size: 1.5 million words Sampling period: Post 1975 Details of material: Corpus consists of material that was "written to be read", i.e. no samples of poetry, plays, speeches, etc which are meant to be spoken. All texts in educated British English (no varieties of English or non-educated English allowed). Organisation: Divided into the following categories: NON-FICTION I Arts NAUT autobiography/biography NEDU education NHIS history NLIN language and linguistics NLIT literary criticism NPHI philosophy NPSY psychology and psychiatry NSOC sociology and anthropology NWOM women's studies II Sciences NBIO biology NCHE chemistry NECO economics NGEO geography NMED health and medicine NPHY physics III Miscellaneous NGEN non-fiction, general NLAW law and government NMYS mysticism and the occult NPOL politics NREL religion and mythology NTRA travel FICTION FCRI crime and mystery FHOR horror FHUM humour FNOV general fiction, novel FPSY psychological novel FROM love and romance FSFF science fiction and fantasy FSTO general fiction, short story FTHR thriller and adventure How transcribed: Ordinary written text, but with additional codes modelled on the LOB coding system, to preserve printed features. How analysed: Use of corpus: Study of linguistic variation Availability: Storage details: Other: --------------------------------------------------------------------- PoW CORPUS (Polytechnic of Wales Corpus) --------------------------------------------------------------------- Compiled by: Robin P. Fawcett & Michael R. Perkins Compiled at: Polytechnic of Wales Date of compilation: 1978 - 1984 Language (variety): English - children's Spoken/written: Spoken Size: Approximately 100,000 words, 11,396 lines. Details of material: Data comprises children's speech from Ponypridd, South Wales. Informal register. Children were screened to exclude those with strong second language influence (Welsh or otherwise). 120 children were involved, aged between 6-12, divided equally according to sex, age, socio-economic class established by profession, and highest educational level of parents. Children were recorded whilst at play, and were also interviewed by a "friendly" adult. Organisation: 194 files, each with a reference to age, social class, sex, play session or interview, and child's initials. Each file is a sample of a single child's speech in a play session or interview. How transcribed: Recordings transcribed using conventions from Survey of Modern English Usage at University College, London, and those of a similar project at Bristol, with pitch movements marked by a trained phonetician to produce a hard-copy version. Machine-readable version contains no prosodic information. How analysed: Fully hand parsed, using a Systemic Functional Grammar developed by Fawcett to include Fuctional and Formal Syntactic categories, capable of handling raising, dummy subject clauses, ellipsis, replacement strings. Parse trees stored in a numerical format (not standard bracketed) to capture discontinuities in syntactic structures. Use of corpus: Psycholinguistic research into development of children's English between ages of 6 and 12. Growth of a variety of syntactico- semantic structures. Current research: COMMUNAL project; Natural Language Processing at UWCC and Leeds University extracting machine-readable systemic functional grammars and lexicons for use in parsing. Availability: Only parsed version of corpus available in machine-readable form; the recorded tapes and and 4-volume transcripts with intonation contours are available in hard copy. Can be obtained (at cost of the materials only) from: Robin P. Fawcett Department of Behavioral and Communication Studies Polytechnic of Wales Treforest Cardiff CF37 1DL U.K. Storage details: VMS Backup or TAR. Data has 1 sentence per line, hence some very long lines. Also available in 80 chars wrap round format. Requires 1 Mb storage. --------------------------------------------------------------------- SEC CORPUS (Lancaster Spoken English Corpus) --------------------------------------------------------------------- Compiled by: UCREL & IBM UK Scientific Centre, Speech Group. Compiled at: University of Lancaster, IBM Ltd, Winchester. Date of compilation: 1984 - 1987 Language (variety): British English Spoken/written: Spoken Size: Approximately 52,000 words. Details of material: Samples taken from BBC Radio broadcasts, recordings made at University of Lancaster, Open University tapes. Speakers have accents as close to RP (standard British English) as possible, and are all adults. No information included on social class, education, etc of speakers. Organisation: Divided into categories: A Commentary B News broadcast C Lecture - type I aimed at general audience D Lecture - type II aimed at restricted audience E Religious broadcast F Magazine-style reporting G Fiction H Poetry J Dialogue K Propaganda M Miscellaneous 52 texts spread over the 11 categories. How transcribed: Orthographically, prosodically, and without any notation at all - i.e. unpunctuated running text. How analysed: Word-tagged using CLAWS2 tagging system. Manually parsed using "skeleton" parsing system. Use of corpus: Speech synthesis project in collaboration with IBM UK Scientific Centre Speech Group. Availability: Distributed through ICAME. Storage details: See ICAME distribution entry. Other: Prosodic transcriptions contain a set of non- standard characters to represent the prododic marks. --------------------------------------------------------------------- SURVEY OF ENGLISH USAGE (Category I texts) --------------------------------------------------------------------- Compiled by: R. Quirk Computerization by S. Greenbaum & G. Kaye Compiled at: University College, London Date of compilation: Started in 1959 Language (variety): British English Spoken/written: Written Size: Approximately 500,000 words Details of material: Category "I" of the overall Survey of English Usage Corpus. I Material with origin in writing (100 texts) A Printed Learned arts Learned sciences Instructional Press Administrative Legal Persuasive writing Prose fiction B Non-printed Continuous writing Letters - social Letters - non-social Personal journals C As spoken Drama Formal scripted oration Broadcast news Talks Stories How transcribed: Ordinary written text. Use of corpus: The study of written British English. Availability: Not available. --------------------------------------------------------------------- SUSANNE CORPUS - under development --------------------------------------------------------------------- Compiled by: G. Sampson Compiled at: University of Leeds Date of compilation: Language (variety): American English Spoken/written: Written Size: 128,000 words Details of material: The SUSANNE project aims to turn the Gothenburg corpus into a more accessible and useful research resource by replacing its existing coding with a more transparent and unambiguous notation, eliminating inconsistencies and errors, and incorporating various categories of additional information. Full orthographic details of the original texts will be retored. Organisation: See entry for the Gothenburg corpus How transcribed: Full orthographic transcription incorporating punctuation. How analysed: The Gothenburg wordtags will be replaced with a more detailed tagset. More complete information about underlying grammatical structure will be included. Use of corpus: Syntactic study. Availability: Storage details: Other: SUSANNE = Surface and Underlying Structural Analyses of Naturalistic English. --------------------------------------------------------------------- WARWICK CORPUS --------------------------------------------------------------------- Compiled by: J. M. Gill Reformatted by OUCS Compiled at: University of Warwick Date of compilation: 1976 - 78 Language (variety): British English Spoken/written: Written Size: Over 2.5 million words Details of material: A miscellaneous collection of short letters, minutes, lists, course notes, general and children's fiction, instruction books, etc. Organisation: These categories were assigned by Catherine Griffin of OUCS in 1980. AO Press, Newsletters, bulletins AC Bank notices (not statements) DO Religion DA Data samples, test data EO Homecraft, cooking, knitting EC Educational courses, conference progs ED Education-course lectures, generally written lectures which are not articles. FO Popular lore (vasectomy, goat-keeping, guide-dog training, place information) GO Biography, essays, monographs, reviews HA Government: meetings, papers HB Associations, clubs-minutes, etc HC University: minutes, meetings IL Informal spoken speech: shows, radio KO General fiction KC Children's fiction LA Law: not lessions therein - reports, cases, contracts LE Spoken formal speech - lectures LI Lists: people, firms, words, menus LO Letters - official LP Letters - personal PU Puzzles QU Questionnaires SO Promotional material SC Non-textual: schedules, accounts, dialling codes SR Record sleeves (not just a list of sides, but commentary) TO Instructions: for machines, medicine, in case of fire TG General information: catch-all: anything which might go on a bulletin board UO Lessons, exams VO Bibliography How transcribed: Completely in upper-case letters. How analysed: Not analysed Use of corpus: Intended for use in research project concerned with the automatic generation of Braille by computer. Availability: Re-formatted version distributed through Oxford Text Archive ********************************************************************* 2. NON-ENGLISH MACHINE-READABLE CORPORA --------------------------------------------------------------------- A LANGUAGE BANK OF MODERN SWEDISH - An ongoing collection of data --------------------------------------------------------------------- Compiled by: The Language Bank Compiled at: Goteborgs Universitet Sprakdata Department of Computational Linguistics Date of compilation: Started in 1975 Language (variety): Modern Swedish Spoken/written: Spoken and written Size: Approximately 30 million words to date. Details of material: The Language Bank consists of samples of: Fiction Legal texts Proceedings of the Swedish parliament Daily newspapers Weekly magazines Dictionary material Frequency dictionary of present-day Swedish Swedish Academy Glossary (Tenth edition) Organisation: The material is split into the "word bank", i.e. material for use in lexical research such as the dictionaries, glossary, etc. and the "language bank", which comprises authentic texts and spoken language. How analysed: Concordances have been produced for: Press 65 - 1 million words of text from morning newspapers Press 76 - 1.3 million words of text from morning newspapers Parliament debates - 4 million words Novels 1976 - 77 - 5.6 million words from 69 novels published c. 1976. Novels 1981 - 3.7 million words from 60 novels published c. 1981. Legal Language - 500,000 words from 1978-81. Vocabularies lists for the above are also available. Availability: The material of the Language bank is available for non-commercial purposes. The user of processed material is required to sign a standard agreement with the Language Bank. Contact Martin Gellerstam or Christian Sjogreen, at Goteborgs Universitet Sprakdata, Department of Computational Linguistics, Sprakdata, S-412 98 Goteborg, Sweden. --------------------------------------------------------------------- BONNER ZEITUNGSKORPUS TEIL 1 (BZK) --------------------------------------------------------------------- Compiled by: Institut fur Deutsche Sprache (IDS) Compiled at: IDS Date of compilation: Sampling period: 1949, 1954, 1959, 1964, 1969, 1974 Language (variety): German Spoken/written: Written Size: c. 3 million words Details of material: A representative selection of samples from the newspapers "Neues Deutschland" (East Germany), and "Die Welt" (West Germany) Organisation: How transcribed: How analysed: Word lists and frequency lists have been produced. Use of corpus: Availability: Available from IDS. Institut fur Deutsche Sprache Sitz Mannheim Friedrich-Karl-Strasse 12 Postfach 54 09 6800 Mannheim 1 Storage details: Stored in EBDIC format --------------------------------------------------------------------- CORPUS OF PORTUGUESE - under development --------------------------------------------------------------------- Compiled by: Maria Tereza Camargo Biderman Compiled at: Institute of Arts, Social Sciences and Education, Araraquara, Brazil Date of compilation: Language (variety): Brazilian Portuguese, Portuguese Portuguese, and African Portuguese Spoken/written: Written Size: 3 million words of Brazilian Portuguese 1 million words of Portuguese Portuguese 1 million words of African Portuguese Details of material: Texts will be taken from: Novels Plays Journalism Technical & scientific literature Organisation: How transcribed: How analysed: Word frequency and concordance analyses will be performed. Use of corpus: To obtain a word frequency dictionary of the modern Portuguese language. Availability: Storage details: Other: --------------------------------------------------------------------- DIALOGSTRUKTURENKORPUS (DSK) --------------------------------------------------------------------- Compiled by: Compiled at: Date of compilation: Language (variety): German Spoken/written: Spoken Size: c. 200,000 words Details of material: Identical in parts with the Freiburger corpus - see that entry. Organisation: How transcribed: All in lower case How analysed: Use of corpus: Availability: Distributed through IDS. Institut fur Deutsche Sprache Sitz Mannheim Friedrich-Karl-Strasse 12 Postfach 54 09 6800 Mannheim 1 Storage details: Magnetic tape. Other: --------------------------------------------------------------------- FREIBURGER CORPUS (FK) --------------------------------------------------------------------- Compiled by: Institut fur Deutsche Sprache (IDS) Compiled at: IDS Date of compilation: Sampling period: 1968 - 74 Language (variety): German Spoken/written: Spoken Size: 0.5 million words Details of material: 224 texts/documents Organisation: Covers the following topics: Discussions Interviews Speeches Reports Narrations Documentary How transcribed: How analysed: Word lists and frequency lists have been produced. KWIC concordances are also available. Use of corpus: Availability: Available from IDS. Institut fur Deutsche Sprache Sitz Mannheim Friedrich-Karl-Strasse 12 Postfach 54 09 6800 Mannheim 1 Storage details: Provided in EBDIC --------------------------------------------------------------------- HANDBUCHKORPORA H85, H86, H87 --------------------------------------------------------------------- Compiled by: Compiled at: Date of compilation: Language (variety): German Spoken/written: Written Size: c. 7 million words, but due to be extended Details of material: Consists of articles on different topics from the daily "Mannheimer Morgen" (morning paper), and the weekly paper "Die Zeit" from the years 1985, 1986, and 1987 (hence the division into three corpora H85, H86, and H87). Organisation: How transcribed: How analysed: Use of corpus: Availability: Available from IDS for purchase or rent Institut fur Deutsche Sprache Sitz Mannheim Friedrich-Karl-Strasse 12 Postfach 54 09 6800 Mannheim 1 Storage details: Magnetic tape. Other: Corpora to be extended with articles from other publications as well as extracts from other kinds of texts. --------------------------------------------------------------------- LIMAS CORPUS --------------------------------------------------------------------- Compiled by: Compiled at: Bonn and Regensburg Date of compilation: Language (variety): German Spoken/written: Written Size: c. 1 million words Details of material: 500 text extracts of c 2000 words from 33 subject areas. Organisation: How transcribed: How analysed: Use of corpus: Availability: Institut fur Phonetik und Kommunikations- wissenschaft, Universitat Bonn. And IDS. Institut fur Deutsche Sprache Sitz Mannheim Friedrich-Karl-Strasse 12 Postfach 54 09 6800 Mannheim 1 Storage details: Version being used at IDS is not structured so that specific texts cannot yet be accessed. Other: --------------------------------------------------------------------- MANNHEIM CORPORA - MK1 and MK2 --------------------------------------------------------------------- There are two Mannheim corpora, referred to as MK1 and MK2. Compiled by: Institut fur Deutsche Sprache (IDS) Compiled at: IDS Date of compilation: Sampling period: 1960-67 (MK1) and 1949-73 (MK2) Language (variety): German Spoken/written: Written Size: 2.2 million words (MK1) 0.3 million words (MK2) Details of material: MK1 contains samples from: Classical literature Popular literature Memoirs Scientific & popular scientific literature Articles from newspapers & magazines MK2 contains samples from: Instruction manuals Textbooks News reports Prospectuses Popular literature Scientific & popular scientific literature Articles from newspapers & magazines legal/legislative documents Organisation: How transcribed: How analysed: Word lists and frequency lists (organised alphabetically or in order of frequency) have been produced. For MK1 valency data is available - listing contains morphosyntactic data for approximately 4000 selected verbs occurring in the corpus. Use of corpus: Availability: Available for purchase or rent from IDS. Word lists, frequency lists, and KWIC concordances also available. Details of costs on request. Institut fur Deutsche Sprache Sitz Mannheim Friedrich-Karl-Strasse 12 Postfach 54 09 6800 Mannheim 1 Storage details: All data stored in EBDIC format. Magnetic tape. Other: Programs developed to search the corpora can be obtained or used on site - details from IDS. --------------------------------------------------------------------- THOMAS MANN CORPUS --------------------------------------------------------------------- Compiled by: Compiled at: Date of compilation: Language (variety): German Spoken/written: Written Size: c. 3.3 million words Details of material: The works of Thomas Mann Organisation: Consists of: Die Buddenbrooks Konigliche Hoheit Lotte in Weimar Der Zauberberg Joseph und seiner Bruder Doktor Faustus Der Erwahlte Die Bekenntnisse des Hochstaplers Felix Krull Erzahlungen [In 1988 4 volumes of speeches and essays were added.] How transcribed: How analysed: Use of corpus: Availability: Distributed through IDS Institut fur Deutsche Sprache Sitz Mannheim Friedrich-Karl-Strasse 12 Postfach 54 09 6800 Mannheim 1 Storage details: Magnetic tape Other: The collection was left to the IDS by Higuchi of the Kyushu University, Japan. *************************************************************** IRLIST Digest is distributed from the University of California, Division of Library Automation, 300 Lakeside Drive, Oakland, CA. 94612-3550. Send subscription requests to: LISTSERV@UCCVMA.BITNET Send submissions to IRLIST to: IR-L@UCCVMA.BITNET Editorial Staff: Clifford Lynch lynch@postgres.berkeley.edu calur@uccmvsa.bitnet Mary Engle engle@cmsa.berkeley.edu meeur@uccmvsa.bitnet Nancy Gusack ncgur@uccmvsa.bitnet The IRLIST Archives will be set up for anonymous FTP, and the address will be announced in future issues. These files are not to be sold or used for commercial purposes. Contact Mary Engle or Nancy Gusack for more information on IRLIST. The opinions expressed in IRLIST do not represent those of the editors or the University of California. Authors assume full responsibility for the contents of their submissions to IRLIST.