Over the years XRCE has been investigating challenges related to the automation of document processes, including document understanding, data migration to XML, schema management, and process modeling. This includes research on analyzing and understanding document collections based on the documents' layout and structural organization.
Those methods can be thought of as going beyond what OCR systems do at the character and word level to reconstruct higher level structures, and extracted business information.
The main difficulty lies in the enormous variety of document content layout. To exhaustively inventory all the possibilities is a never ending and expensive task. To tackle this problem, our approach consists in automatically detecting regularities at various levels in a document (using layout as well as content information), and then extracting the targeted information: layout-oriented (document conversion) or business data-oriented (data extraction).
Practically, we articulate our research around three key points:
Rich typographical modeling: using well know typographical objects (format, type area, grid) allows us to accurately extract document objects.
Document as processing unit: document as processing level allows us to efficiently detect regularities by considering redundant information.
Data model for driving layout analysis: Complex data uses complex layout. Our solution consists in first modeling the data, and then let the data model supervise the layout analysis step.