Universal Segmentation Of Text With The Sumo Formalism

Julien Quint
Sumo is a formalism for universal segmentation of text. Its purpose is to provide a framework for the creation of segmentation applications. It is called #universal# as the formalism itself is independent of the language of the documents to process and independent of the levels of segmentation #e.g. words, sentences, paragraphs, morphemes...# considered by the target application. This framework relies on a layered structure representing the possible segmentations of the document. This structure and the tools to manipulate it are described, followed by detailed examples highlighting some features of Sumo.
Proceedings of NLP 2000, pages 16-26, Patras, Greece, June, 2000.