System for converting PDF documents into structured XML format
Hervé Dejean, Jean-Luc Meunier
We present in this paper a system for converting PDF legacy documents into structured XML format. This conversion system first extract the different streams contained in the PDF files (text, bitmap and vectorial images) and then applies different components in order to logically structure them. Some are traditional in Document analysis, other more specific to PDF. We also present a graphical user interface in order to check, correct and validate the analysis of the components. The final XML schema corresponds to a generic representation of documents. After having presented the different components and the general architecture, we eventually report on real user cases.
7TH IAPR Workshop on Document Analysis Systems, Nelson, New Zealand, 13-15 February 2006.
das06-026.pdf (409.77 kB)