Corpora – Susan Nacey

Reading Time: 2 minutes read

British National Corpus (BYU-BNC)
British National Corpus 2014. A resource for research and teaching on the contemporary English language.
Clarino Repository Home. A Norwegian infrastructure project to make existing and future language resources easily accessible for researchers and to bring eScience to humanities disciplines.
COREFL and CEDEL2: Corpora of L2 Spanish and L2 English. Also with the option be an informant!
Corpus of Political Speeches. An online archive of speeches from politicians around the world. This corpus has a web-based concordance feature, which allows corpus searches in untagged texts.
Dialogue corpora: Coconut corpus, Dialog diversity ‘corpus’, Speech act annotated dialogues corpus, SRI American Express travel agent dialogue corpus, Switchboard corpus, TRAINS spoken dialogue corpus.
EF – Cambridge Open Language Database: Currently contains over 83 million words from 1 million assignments written by 174,000 learners, across a wide range of levels (CEFR stages A1-C2). This text corpus includes information on learner errors, part of speech, and grammatical relationships.
EuroCoat: The European Corpus of Academic Talk. 27 Spanish undergraduate students from different universities and academic disciplines were video-recorded in conversation with their lecturers. The resulting 5 hours and 47 minutes of conversation was subsequently transcribed and form what is, to the best of our knowledge, the first corpus of office hours’ consultations carried out in English as academic lingua franca.
Growth in Grammar. A three-year project studying how English children’s written language develops as they progress through their school careers: 2,898 texts from 983 children in 24 schools and used a number of computer-assisted methods to understand differences in the use of grammar and vocabulary across year groups and text types.
Korean English Learners’ Spoken Corpus (KELSC). All CEFR proficiency levels, 36000 + words.
LOCO: the 88-million word language of conspiracy corpus. Read about the corpus in this Open Access article.
MuSSeL. The Multilingual Corpus of Second Language Speech (MuSSeL) is being developed by researchers at the University of Utah’s Second Language Teaching & Research Center. It provides researchers and teachers with an unprecedentedly large and varied set of transcribed and tagged L2 speech samples as well as access to the original MP3 recordings.
Norwegian-English Student Translation Corpus (NEST): Translations from Norwegian into English, produced by students of English at Norwegian universities and colleges.
Pelcra: Tools and resources for Polish and English corpora.