a Formalism For Universal Segmentation of Text
Sumo is a formalism for universal segmentation of text. Its purpose is to provide a framework
for the creation of segmentation applications. It is called #universal# as the formalism itself is
independent of the language of the documents to process and independent of the levels of segmentation
#e.g. words, sentences, paragraphs, morphemes...# considered by the target application.
This framework relies on a layered structure representing the possible segmentations of
the document. This structure and the tools to manipulate it are described, followed by detailed
examples highlighting some features of Sumo.