Supervised learning for the legacy document conversion
Boris Chidlovskii, Jérôme Fuselier
We consider the problem of document conversion from the rendering oriented HTML markup into a semantic-oriented XML annotation defined by user-specific DTDs or XML Schema descriptions. We represent both source and target documents as rooted ordered trees so the conversion can be achieved by applying a set of tree transformations. We apply the supervised learning framework to the conversion task according to which the tree transformation are learned from a set of training examples. We develop a two-step approach to the conversion problem, that first labels the leaves in the source trees and then recomposes the target trees from the leaf labels. We present two solutions based of the leaf classification with the target terminals and paths. Moreover, we deveop three methods for the leaf classification. All methods and solutions have been
tested on two real collections.
ACM Symposium on Document Engineering, Milwaukee, Wisconsin, USA, October 28-30, 2004.