Abstract

Adriano Ferraresi (University of Naples “Federico II”, University of Bologna)

Identifying collocations in specialised corpora: a preliminary Investigation combining frequency and semantic information

Abstract

The identification of collocations plays a crucial role in various fields, from discourse analysis to lexicography. However, current approaches to the definition and identification of collocations do not always prove satisfactory. Statistical approaches relying on frequency data tend to produce inaccurate results, i.e. mixing intuitively relevant collocations and less interesting combinations (Evert & Krenn 2001). At the same time, approaches based on fine-grained semantic criteria, be they manual or computational, have proved unsuccessful (Sag et al. 2002; Nesselhauf 2004). In this pilot study, a preliminary investigation will be carried out as to the possibility to implement a method combining these approaches. This will be applied to the task of identifying phraseology in a specific variety of English, namely institutional academic English published on University Websites.

Focusing on adjective-noun pairs extracted from a specialised corpus using statistical methods, this paper will attempt to verify empirically the non-substitutability hypothesis (Manning & Schütze 1999), that states that, in a collocation, there is limited possibility to replace one of the constituents while preserving the idiomaticity of the sequence. If this were the case we could be able to infer the (lack of) collocationality of specific sequences based on the number of semantically related words that co-occur with a given node word. Information about lexical substitutability (e.g. synonymy) will be gathered from existing ontologies such as WordNet, and will be applied to the task of classifying the sequences extracted from the corpus.

This preliminary investigation hopes to shed light on some of the complex theoretical issues concerning the dentification of collocations, and to lay the basis for the future development of refined methods for identifying phraseology in specialised corpora.

Reference
Evert, S. & Krenn, B. (2001) Methods for the qualitative evaluation of lexical association measures. In Proc. of the 39th Annual Meeting of the ACL. Toulouse. 188-195
Manning, C. & Schütze, H. (1999) Foundations of statistical natural language processing. Cambridge: MIT Press.
Nesselhauf, N. (2004) Collocations in a learner corpus. Amsterdam: Benjamins.
Sag, I., Baldwin, T., Bond, F., Copestake, A. & Flickinger, D. (2002) Multiword Expressions: A Pain in the Neck for NLP. In Proc. of CICLING 2002. Mexico City. 1–15