Schema Extraction from XML Data

Boris Chidlovskii
New XML schema languages have been recently proposed to replace Document Type Definitions (DTDs) as schema mechanism for XML data. These languages consistently combine grammar-based constructions with constraint- and pattern-based ones and have a better expressive power than DTDs. As schema remain optional for XML data, we address the problem of schema extraction from XML data. We model the XML schema as extended context-free grammars and propose the schema extraction algorithm that is based on methods of grammatical inference. The extraction algorithm copes also with the schema determinism requirement imposed by XML DTDs and XML Schema languages. We report results of some tests on real XML collections.
KRDB'01 Workshop (Knowledge Representation and Databases), Rome, Italy, September 15, 2001


schemaExtr.pdf (144.67 kB)