Information Retrieval List Digest 009 (February 9, 1990) URL = http://hegel.lib.ncsu.edu/stacks/serials/irld/irld-009 IRLIST Digest February 9, 1990 Volume VII Number 3 Issue 9 *************************************************************** Continued from Volume VII Number 3, Issue 8 *************************************************************** IV. PROJECTS: Initiatives and proposals / Bibliographies Abstracts / Miscellaneous D.4. Survey Corpora (continued) *************************************************************** IV.D.4. (continued) Fr: FAFSRV%NOBERGEN.BITNET@CUNYVM.CUNY.EDU Re: SURVEY CORPORA HELSINKI CORPUS - diachronic part (Under development) --------------------------------------------------------------------- Compiled by: Matti Rissanen, Ossi Ihalainen, Merja Kyoto. Compiled at: Department of English, University of Helsinki. Date of compilation: Language (variety): English 850-1720 (Old, Middle, Modern English) British, American and Scots English. Spoken/written: Written Size: 1.6 million words Details of material: Various text types (law, handbooks, science, trials, sermons, diaries, documents, plays, private and official correspondence, etc). Organisation: Periodization: Old English -850 850 - 950 950 - 1050 1050 - 1150 Middle English 1150 - 1250 1250 - 1350 1350 - 1420 1420 - 1500 Modern English 1500 - 1570 1570 - 1640 1640 - 1710 (1720) How transcribed: Standard editions followed as far as possible. How analysed: No grammatical analysis. Textual coding gives parameters describing texts (date, author, etc). Use of corpus: For variational study of the development of English; for pilot studies and further development of the corpus. Availability: Will be available (with possible restrictions as to certain topics of special interest to the compilers). Storage details: Floppy disks and tapes; mainframe tapes. Other: Coding for italics, emendations, editor's comments, the compilers' comments, foreign language, superscript, accents, headings, runes, are included. --------------------------------------------------------------------- HELSINKI CORPUS - contemporary dialects --------------------------------------------------------------------- Compiled by: Post-graduate students at the University of Helsinki. Compiled at: University of Helsinki. Date of compilation: Language (variety): English, Hiberno-English regional dialects, conservative rural vernacular. Spoken/written: Spoken Size: 245,000 words Details of material: Speakers are elderly (60+) male/female natives of small rural villages, sampled in the 1970's. Organisation: Organised in terms of counties, villages and speakers. How transcribed: Orthographically How analysed: No grammatical analysis yet, but will be 'word-tagged'. Use of corpus: For the study of dialectal syntax, to provide material for theses and dissertations, for teaching purposes. Availability: Dependent upon individual researchers - The transcribers, who also made the original recordings in the field, have full copyright. At the moment these texts cannot be used without their permission. Contact: Ossi Ihalainen Department of English University of Helsinki Porthania 311, 00100 Helsinki Finland. Storage details: Floppy disks (a Wordcruncher file is available). Other: COCOA reference format. UHER 4200 report machines were used to make the recordings. --------------------------------------------------------------------- INTERNATIONAL CORPUS OF ENGLISH - under development --------------------------------------------------------------------- To be compiled by: Five national groups: BRITAIN: Sidney Greenbaum University College London USA: Charles F Meyer University of Massachusetts-Boston CANADA: Margery Fee Strathy Language Unit, Queen's Univ. AUSTRALIA: Mr David Blair, Peter Collins, Mrs Pam Peters. Macquarie University NIGERIA: Obafemi Kujore University of Ibadan Compiled at: To be coordinated by Sidney Greenbaum Date of compilation: Language (variety): Varieties of English. Spoken/written: Spoken and written texts. Size: Details of material: Organisation: How transcribed: How analysed: Use of corpus: Availability: Storage details: Other: --------------------------------------------------------------------- JDEST CORPUS (Jiao Tong University corpus for EST) --------------------------------------------------------------------- Compiled by: Yang Huizhong Compiled at: Jiao Tong University Date of compilation: Language (variety): English Spoken/written: Written Size: Approximately 1 million words Details of material: Consists of English used in Science and Technology. Organisation: Divided into the following subject areas: Computers Metallurgy Machine Building Physics Electrical Engineering Civil Engineering Chemical Engineering Naval Architecture Atomic Energy Aircraft Manufacturing How transcribed: Ordinary written text How analysed: Frequency list has been produced Use of corpus: To meet the needs of students of English used in science and technology. Availability: Storage details: Other: --------------------------------------------------------------------- KOLHAPUR CORPUS OF INDIAN ENGLISH --------------------------------------------------------------------- Compiled by: S. V. Shastri Compiled at: Shivaji University, Kolhapur Date of compilation: 1980 - 1986 Language (variety): Indian English Spoken/written: Written Size: Approximately 1 million words Details of material: Samples of material printed and published in 1978. Organisation: Divided into the same categories as Brown corpus: A Press: reportage B Press: editorial C Press: reviews D Religion E Skills, trades & hobbies F Popular lore G Belles lettres H Miscellaneous J Learned & scientific writings K General fiction L Mystery & detective fiction M Science fiction N Adventure P Romance & love story R Humour 500 texts spread over the 15 categories, with approximately 2000 words per text. How transcribed: Orthographically with additional special codes to represent features of the original printed text. How analysed: Some grammatical information annotated, e.g. possessive "'s" is distinguished from contracted form of "is" or "has"; functions of "to" are distinguished. Availability: Distributed through ICAME. Storage details: See ICAME distribution entry. --------------------------------------------------------------------- LANCASTER-LEEDS TREEBANK --------------------------------------------------------------------- Compiled by: G. Sampson, G.N. Leech Compiled at: University of Leeds, University of Lancaster Date of compilation: Language (variety): British English Spoken/written: Written Size: 45,000 words Details of material: Samples from all 15 categories of the LOB corpus. See entry for LOB corpus. Organisation: How transcribed: How analysed: Phrase-structure analysis. Purely formal and "surfacy": the role of a consituent within its superordinate unit is not indicated unless it is implied by the formal category of the constituent, and there is no indication of "underlying structure". Use of corpus: Originally intended for training a probabilistic parser. Availability: Contact: Carol Lockhart (CCALAS Secretary) Department of Linguistics & Phonetics University of Leeds Leeds LS2 9JT England Storage details: Other: --------------------------------------------------------------------- LANCASTER PARSED CORPUS --------------------------------------------------------------------- Compiled by: Roger Garside & G.N. Leech Compiled at: University of Lancaster Date of compilation: 1986-89 Language (variety): British English Spoken/written: Written Size: c. 65000 words Details of material: Ten texts from each of the fifteen LOB LOB categories (all the texts in the categories containing less than 10 texts). A total of 145 texts, but some sentences were rejected as being too long to process. Organisation: How transcribed: How analysed: Automatically parsed using the UCREL parsing system which uses statistics derived from the Lancaster-Leeds treebank (see entry for Lancaster-Leeds treebank). The parsing scheme is similar to that of the Lancaster-Leeds treebank, though it has been simplified by eliminating some subcategory symbols. Use of corpus: Availability: Available for limited distribution by early May 1989. Contact: The UCREL Secretary Department of Linguistics & Modern English Language University of Lancaster Lancaster LA1 4YT England Storage details: Other: --------------------------------------------------------------------- LOB CORPUS (Lancaster-Oslo/Bergen corpus) --------------------------------------------------------------------- Compiled by: Stig Johansson & G. N. Leech Compiled at: University of Oslo & University of Lancaster Date of compilation: 1970 - 1976 Language (variety): British English Spoken/written: Written Size: Approximately 1 million words Sampling period: 1961 Organisation: Divided into categories: A Press: reportage B Press: editorial C Press: reviews D Religion E Skills, trades & hobbies F Popular Lore G Belles lettres, biography, essays H Miscellaneous J Learned & scientific writings K General fiction L Mystery & detective fiction M Science fiction N Adventure & western fiction P Romance & love story R Humour 500 texts spread over the 15 categories, with 2000 words per text. How transcribed: Orthographically. How analysed: Word-tagged using CLAWS1 tagging system. Subsection has been manually parsed - see entry for "Lancaster Treebank". Subsection has been automatically parsed - see entry for "Lancaster parsed corpus". Availability: Distributed through ICAME, and Oxford Text Archives. Storage details: See ICAME distribution entry. Other: Orthographic version contains coding symbols used to represent features of the original printed text. Parallel corpus to the Brown corpus --------------------------------------------------------------------- LONDON-LUND CORPUS OF SPOKEN ENGLISH --------------------------------------------------------------------- Compiled by: The Survey of Spoken English Director: Jan Svartvik using spoken material from the Survey of English Usage Corpus, director R. Quirk. Compiled at: Lund University Date of compilation: 1975 - 1981 Language (variety): British English Spoken/written: Spoken Size: Approximately 500,000 words Details of material: Adult native speakers of British English. Broadcast and recorded material. Organisation: Category "II" of the overall SURVEY OF ENGLISH USAGE CORPUS. II Material with origin in speech (100 texts) Subdivided into: II. A Monologue Prepared by unscripted oration Spontaneous oration Spontaneous commentary B Dialogue Conversation Surreptitious Non-surreptitious Telephone How transcribed: Prosodically transcribed How analysed: Has been word-tagged using semi-automatic system developed at Lund. Semi-automatic partial syntactic analysis. Use of corpus: Analysis of spoken English. Availability: Prosodically transcribed version is distributed through ICAME. KWIC concordances available for text categories 1 - 12, also distributed through ICAME. Other: Subgroup A in the text classification, consisting of 34 conversation texts, has also been printed in a book "A corpus of English Conversation", edited by J. Svartvik & R. Quirk (1980), in the series "Lund Studies in English", CWK Gleerup Publishers. --------------------------------------------------------------------- LONGMAN/LANCASTER ENGLISH LANGUAGE CORPUS - under development --------------------------------------------------------------------- Compiled by: Dictionaries Division, Longman Group Ltd. Divisional Director: Della Summers Advisers: Sir Randolph Quirk, Geoffrey Leech Compiled at: Longman Group Ltd., Longman House, Burnt Mill, Harlow, Essex CM20 2JE. Date of compilation: In progress mainly since 1985 Language (variety): English. A wide range of varieties, including British English, American English, and other national varieties. Mainly standard English of the twentieth century, sampled from varied stylistic levels and text types. Spoken/written: Both spoken and written data. Sampling Period: Mainly later 20th century, but including some earlier material still in current use. Size: Planned size: 30 - 50 million words. Details of material: Includes formal and informal, technical and non-technical styles. Organisation: Organization to be decided after sampling stage. Preliminary breakdown of categories: A. FIELD 1. Informative/imaginative 2. 'Superfields' : 1. natural & pure science, 2. applied science 3. Major subject areas 4. Individual subjects B. MEDIUM written (published/manuscript) spoken (recorded/broadcast) C. TIME pre-20th century/20th century D. REGION British/American/Australian/Caribbean/Indian E. TEXT LENGTH short/medium/long F. 'LEVEL' high (= technical, literary)/medium (= general, layperson)/low (= 'popular') Other parameters (e.g. sex of author/speaker) will be annotated, and taken account of in the composition of the corpus. How transcribed: Spoken material will be transcribed orthographically. How analysed: Not for the time being. Use of corpus: For lexicographic and academic research Availability: It is hoped to make the corpus available for academic research. These matters are at present under negotiation. Will be distributed by Longman. Other: Data will contain special coding symbols, a full key will be provided for permitted users. --------------------------------------------------------------------- MACQUARIE (UNIVERSITY) CORPUS - under development --------------------------------------------------------------------- Compiled by: Pam Peters, David Blair, Peter Collins, Alison Brierley. Compiled at: Macquarie University, University of NSW Date of compilation: Language (variety): Australian English Spoken/written: Written Sampling period: 1986 (a quarter century later than the sampling period for Brown/LOB). Size: Approximately 1 million words Details of material: Will parallel Brown and LOB corpora as closely as possible. Text samples will be c. 2000 words, but complete source text will be kept in a "monitor" corpus. Styles included will be formal, semi-formal, and technical. Organisation: Will parallel Brown and LOB corpora, i.e. have 15 categories, but with minor internal differences prompted by local factors. How transcribed: Modelled on LOB/Brown. How analysed: Will be tagged and parsed at a later stage. Use of corpus: To facilitate inter-dialectical comparisons (with BrE-LOB, and AmE-Brown), and explore aspects of Australian English. Availability: School of English and Linguistics, Macquarie University 2109 NSW, Australia. Storage details: Stored on tape. --------------------------------------------------------------------- MELBOURNE-SURREY CORPUS --------------------------------------------------------------------- Compiled by: G. G. Corbett, Khurshid Ahmad Compiled at: Department of Linguistic and International Studies, and Computing Unit, University of Surrey, Guildford, Surrey GU2 5XH. Date of compilation: Sampling period: 1980-81 Language (variety): Australian English Spoken/written: Written Size: c. 100,000 words Details of material: Taken from the newspaper "The Age" published in Melbourne. The texts are all editorials which appeared from Sept. 1, 1980 to Jan. 30, 1981. Organisation: Stored in 93 separate files. Each file consists of two editorials selected on the same day. How transcribed: Ordinary written text. How analysed: Use of corpus: Of value to those working on varieties of English, and should complement the work being done on spoken Australian English. Availability: Distributed through ICAME. Available for research purposes. Other: Material is all in uppercase, but upper/lower case information is available in the originals which are lodged with ICAME. --------------------------------------------------------------------- NIJMEGEN CORPUS --------------------------------------------------------------------- Compiled by: J Aarts and others Compiled at: University of Nijmegen Date of compilation: Language (variety): British English Spoken/written: Written Size: 1.5 million words Sampling period: Post 1975 Details of material: Corpus consists of material that was "written to be read", i.e. no samples of poetry, plays, speeches, etc which are meant to be spoken. All texts in educated British English (no varieties of English or non-educated English allowed). Organisation: Divided into the following categories: NON-FICTION I Arts NAUT autobiography/biography NEDU education NHIS history NLIN language and linguistics NLIT literary criticism NPHI philosophy NPSY psychology and psychiatry NSOC sociology and anthropology NWOM women's studies II Sciences NBIO biology NCHE chemistry NECO economics NGEO geography NMED health and medicine NPHY physics III Miscellaneous NGEN non-fiction, general NLAW law and government NMYS mysticism and the occult NPOL politics NREL religion and mythology NTRA travel FICTION FCRI crime and mystery FHOR horror FHUM humour FNOV general fiction, novel FPSY psychological novel FROM love and romance FSFF science fiction and fantasy FSTO general fiction, short story FTHR thriller and adventure How transcribed: Ordinary written text, but with additional codes modelled on the LOB coding system, to preserve printed features. How analysed: Use of corpus: Study of linguistic variation Availability: Storage details: Other: *************************************************************** Continued in Volume VI, Number 5, Issue 10 *************************************************************** IRLIST Digest is distributed from the University of California, Division of Library Automation, 300 Lakeside Drive, Oakland, CA. 94612-3550. Send subscription requests to: LISTSERV@UCCVMA.BITNET Send submissions to IRLIST to: IR-L@UCCVMA.BITNET Editorial Staff: Clifford Lynch lynch@postgres.berkeley.edu calur@uccmvsa.bitnet Mary Engle engle@cmsa.berkeley.edu meeur@uccmvsa.bitnet Nancy Gusack ncgur@uccmvsa.bitnet The IRLIST Archives will be set up for anonymous FTP, and the address will be announced in future issues. These files are not to be sold or used for commercial purposes. Contact Mary Engle or Nancy Gusack for more information on IRLIST. The opinions expressed in IRLIST do not represent those of the editors or the University of California. Authors assume full responsibility for the contents of their submissions to IRLIST.