IWRAP : Intelligent Wrapper Learning Tools
The Intelligent Wrapper Learning Tools permit to automatically generate HTML wrappers for accessing heterogeneous information sources. The learning program copes mainly with the "automatic" HTML files; such pages are typically generated by a Web source (like Altavista) as answers to user queries or they contain the regularly updating information (like CNN new pages). The IWrap tools allow to avoid the tedious and error-prone process of the manual preparation of rules for extracting data from heterogeneous Web information sources.
Over the learning phase, the IWrap program analyses the HTML sample pages and learns how to associate user attributes to HTML elements and generate the extraction rules. The user assitance is minimal; it does not require any lookup inside the HTML files and its tags, but consists in associating HTML fragments learned and extracted by IWrap to attribute names. After collecting sufficient statistical data, the program moves to generating exraction rule atomatically.
When the learning is over, the IWrap can start applying the generated extraction rules to input HTML pages. If the program encounters a difficulty when recognizing the page structure and extracting data, it tries to guess how to tune the extraction rules and extend them to the newly encountered situation. This feature is particularly appreciated in situtions when the markup of HTML pages changes over time.
The IWrap program is based on the regular grammar learning mechanism developped at XRCE. It performs equally well for extracting both simple and complex multi-valued attributes from HTML pages. The IWrap tools are used for generation of wrappers in the askOnce commercial product and in various lab prototypes at XRCE.
Unlike all existing techniques of the wrapper induction, the Iwrap program makes no assumption about the HTML files structure and always generates a correct grammar with extraction rules. On the other side, the quality of these rules is controlled by the confidence level set by the wrapper designer/user.
For further contact about the project please contact Boris Chidlovskii.
More like this
- 2000/202 - Wrapper Generation via Grammar Induction
- 2001/012 - Wrapping Web Information Providers by Transducer Induction
- 2000/204 - Automatic Wrapper Generation for Web Search Engines
- 2000/205 - Wrapper Generation by k-Reversible Grammar Induction
- 2003/035 - Information extraction form tree documents by learning subtree delimiters