Document Structure Analysis: Modelling Unseeable Patterns
Simplifying a bit, layout models described in the Document Layout Analysis literature are rather simple: working at the page level, they describe page elements as a possibly recursive organization (graph, tree) of rectangular boxes. But when reading literature from other communities working on documents (codicology, typography, book design), one can find far richer models used to describe documents.
In this presentation, we would like to show how some layout models found in past and modern text layout practices (ruling, grid) can improve document digitization. We will first present these unseeable layout concepts used for manuscripts and printed books, and then explain how they can be of first importance in today's document digitization. In particular, we will show that the traditional concept of type area is a key notion for modeling document layout. We will illustrate this work with several practical usages and evaluations, from OCR improvement to high-level logical segmentation. These examples will highlight the advantage of developing algorithms operating at Document level (and not at the page level).
Workshop Machines and Manuscripts, Karlsruhe, Germany, 19-20 February, 2015.
2015-002.pdf (5.19 MB)