Recognizing Lexical Patterns in Text

Greg Grefenstette, Anne Schiller, Salah Ait-Mokhtar
For most natural language processing tasks, the complexity and richness of the lexicon determines the ultimate performance of the system. In this chapter we present a number of low-level natural language processing techniques for recognizing lexical structures in a domain-specific corpus, concentrating on techniques that precede a manual construction of the lexicon, or that can serve as a basis for an automatic creation of a lexicon. Recognizing things in text is easier for a computer than recognizing things in images. But in both domains recognizing means abstracting away surface difference in order to identify two variants of the same object. A number of techniques have been developed by the computational linguistic community for abstracting away surface difference in text: tokenization, lemmatization, part-of-speech tagging, and finite-state pattern recognition. An overview of these techniques will be presented here.
F. Van Eynde, D. Gibbon (eds.): Lexicon Development for Speech and Language Processing. Kluwer Academic Publishers. 2000.