Over the years XRCE has been investigating challenges related to the automation of document processes, including document understanding, data migration to XML, schema management, and process modeling. This includes research on analyzing and understanding document collections based on a document's layout and structural organization.
Those methods can be thought of as going beyond what OCR systems do at the character and word level to reconstruct higher level structures, and extracted business information.
The main difficulty lies in the enormous variety of document content layout. To do an exhaustive inventory of all the possibilities is a never ending and expensive task. To tackle this problem, our approach consists of automatically detecting regularities at various levels in a document (using layout as well as content information), and then extracting the targeted information: layout-oriented (document conversion) or business data-oriented (data extraction).
Practically, we articulate our research around three key points:
Rich typographical modeling: using well known typographical objects (format, type area, grid) allows us to accurately extract document objects.
Document as a processing unit: this allows us to efficiently detect regularities by considering redundant information.
Data model for driving layout analysis: Complex data uses complex layout. Our solution consists in first modeling the data, then letting the data model supervise the layout analysis step.
- READ: Recognition and Enrichment of Archival Documents (H2020 e-infrastructure)