Aston Logo


Postgraduate Conference Home
Presentations and Posters

Caroline Tagg (Birmingham)

cant believe you forgot my surname, Mr NAME242. Ill give u a clue, its spanish and begins with mů
Issues in anonymising a corpus of text messages


As the personal data embedded into the above text message suggests, the anonymisation of linguistic data held in language corpora is not the straightforward procedure it may at first appear. Instead, it involves carefully balancing the principle of protecting participants with the practicalities of what can feasibly be anonymised without excessively altering or distorting the data. Consequently, most attempts to anonymise data can only be the result of decisions made at every turn as to what anonymisation means and how this understanding can be applied to the corpus in question with consideration of the research purpose it aims to fulfil. Furthermore, legal guidelines are neither explicit nor comprehensive and it is generally up to language researchers to interpret the legalities and ensure that they keep within the law.

This talk focuses on the anonymisation of CorTxt, a corpus of over 11,000 text messages compiled from between March 2004 and May 2007 in the UK. In the talk, I explore whether and why CorTxt should be anonymised, from whom participants must be protected, what needs to be replaced or removed in the corpus, and how to both identify these items and anonymise them. My account includes challenges thrown up by the specific nature of text message data and should prove useful to researchers exploring textese and online communication as well as spoken data.

L10 Web Stats Reporter 3.15 LevelTen Hit Counter - Free PHP Web Analytics Script
LevelTen dallas web development firm - website design, flash, graphics & marketing