What is a Word, What is a Sentence? Problems of Tokenization.
Any linguistic treatmemt of freely occurrinig text must provide an answer to what is considered
as a token. In artificial languages, the definition of what is considered as a token can be precisely
and unamibiguously defined. Natural languages, on the other hand, display such a rich
variety that there are many ways to decide upon what will be considered as a unit for a computational
approach to text. Here we will discuss tokenization as a problem for computational lexicography.
Our discussion will cover the aspects of what is usually considered preprocessing of text in order to
prepare it for some automated treatment. We present the roles of tokenization, methods
of tokenization grammars for recognzing acronyms, abbreviations, and regular expressions such
as numbers and dates. We present the problems encountered and discuss the effects of
seemingly innocent choices.
The 3rd International Conference on Computational Lexicography (COMPLEX'94). pages 79-87. ISBN 963 846178 0, Research Institute for Linguistics Hungarian Academy of Sciences, Budapest, 1994.