Publications
Authors:
  • Boris Chidlovskii
Citation:
ECAI'00 Machine Learning for Information Extraction Workshop, Berlin, August 2000
Abstract:
Modern agent and mediator systems communicate to a multitude of Web information providers to better satisfy
the user requests. They use wrappers to extract relevant information from HTML pages and annotate it with
user-defined labels. A number of approaches exploit the regularity in page structures to induce instances of
wrapper classes. The power of a class is crucial; a more powerful class permits to successfully wrap more
sites. In this work, we use the grammatical inference theory to develop a powerful wrapper class based on the
k-reversible grammars. We also address the sample labeling problem and show how the label conflicts can
make the wrapper inference impossible. We propose the label normalization method in order to discard the
label conflicts and induce partial wrappers.
Year:
2000
Report number:
2000/205
Attachments: