Publication Search Form

Keywords

Authors

Year

We found publication with these paramters.

Information extraction form tree documents by learning subtree delimiters

Boris Chidlovskii
Information extraction form HTML pages has been conventionally treated as plain text documents extended with HTML tags.However, the growing maturity and correct usage of HTML/XHTML formats open an opportunity to treat Web pages as trees, to mine the rich structural context in the trees and to learn accurate extraction rules. In this paper, we generalize the notion of delimiter developed for the string information extraction to tree documents. Similar to delimiters in strings, we define delimiters in tree documents as subtrees surrounding the text leaves. We formalize the wrapper induction for tree documents as learning the classification rules based on the subtree delimiters. We analyze a restricted case of subtree delimiters in the form of simple paths. We design an efficient data structure for storing candidate delimiters and an incremental algorithm for finding most discriminative subtree delimiters for the wrapper.
IJCAI-03 Workshop on Information Integration on the Web
2003
2003/035

Attachments

subtreeDelimiter03Letter.pdf (55.32 kB)