Aston Logo


Symposium Home
Symposium Archive
Contact Us
Postgraduate Conference
ACORN Project
Adam Kilgariff

Corpora for the coming decade


How should we design and gather a corpus that will meet the needs of linguists and others over the next ten years? The most-cited model for corpus design over the last two decades has been the British National Corpus. While its design was clearly excellent for its time and has served very well, it is now approaching twenty years old. It is from the pre-web world, when electronic text was in limited supply (and for many text types not available at all). We need new models for a world where electronic text is available in vast quantities, for most text types, so where corpora can be very large and very cheap to prepare. I will talk about two current projects, both for English, one (Big Web Corpus or BiWeC) concentrating on size, and the other (the New Model Corpus) concentrating on corpus structure, markup, and a collaborative model. Our hope is that the two strands will converge, giving a very large corpus which has many useful, large and well-specified subcorpora, which is richly marked up, and which supports a wide range of research questions across the linguistics and language-technology worlds. The talk will include a demo of the Sketch Engine (a corpus query tool capable of handling multi-billion-word, richly-marked-up corpora) and also some comments on the relation between what we do, in corpus linguistics, and what Google and other commercial search engines do.

L10 Web Stats Reporter 3.15 LevelTen Hit Counter - Free PHP Web Analytics Script
LevelTen dallas web development firm - website design, flash, graphics & marketing