About tables of contents and how to recognize them

Jean-Luc Meunier, Hervé Dejean
We present a method for structuring a document according to the information present in its different organizational tables: table of contents, table of figures, etc. This method is based on a 2-step approach that leverages functional and formal (layout-based) kinds of knowledge. The functional definition of organizational table, based on 5 properties, is used to provide a first solution, which is improved in a second step by automatically learning the form of the table of contents. Furthermore, this method also allows the determination of the parts the table refers to in the document body, and hence also allows structuring the document according to the information present in its organizational tables. We also report on the robustness and performance of the method and we illustrate its use in a real conversion case. We eventually compare it with related work.
To appear in International Journal of Document Analysis and Recognition (IJDAR).
Full paper available on Springer Website