a Formalism For Universal Segmentation of Text

Julien Quint
Sumo is a formalism for universal segmentation of text. Its purpose is to provide a framework for the creation of segmentation applications. It is called #universal# as the formalism itself is independent of the language of the documents to process and independent of the levels of segmentation #e.g. words, sentences, paragraphs, morphemes...# considered by the target application. This framework relies on a layered structure representing the possible segmentations of the document. This structure and the tools to manipulate it are described, followed by detailed examples highlighting some features of Sumo.
Coling 2000