WID - Web Information Discovery
The problem we address
Searching for relevant information on the Web is an important and time consuming activity. The obvious way to serach for information is by using some of the popular general-purpose search engines. These search engines continuously crawl and index billions of Web pages. However, there exists a part of the Web that is unavailable for central indexing. This part, which is often referred to as the Hidden Web or the Invisble Web, includes the content of databases and document collections accessible through (and hidden by) search interfaces offered by various Web sites. We refer to such search interfaces, which allows users to find and access the internal information of a site, as "gateways to the Hidden Web". The Hidden Web spans company sites, libraries, patent databases, university sites, media sites, etc. The goal of the WID project is to make the Hidden Web more visible, and to allow users to find and explore information on the Hidden Web through the same or similar search interfaces they use for the visible Web.
Project objectives and impact
The size of the Hidden Web is estimated to be about 500 times bigger than that of Visible Web. It is further believed that the quality of the information on the Hidden Web is higher, because this information is usually organised in structured databases and with professional usage in mind. Thus, collecting, accessing, and organizing Hidden Web resources has emerged as an interesting challenge for both research and industry.
The main objective of this project is to automate the Hidden Web discovery process. We divide the project into three subprojects: the discovery, the analysis, and the classification of resources. With this separation we try to imitate the human approach to information gathering.
- Discovery: To discover Hidden Web ressources we use conventional crawlers to detect gateways. We collect pages with forms, filter out unrelated forms, and automatically detect the attribute values that are meaningful to use for automatic probing of the query interfaces of the gateways.
- Analysis: We try to automatically understand which types of queries an information source can deal with. In particular, we detect whether a resource understands Boolean operators, if it provides stemming, if it handles phrases, whether it is case-sensitive or not, etc.
- Classification: Through the automatic analysis of information returned from queries, we attempt to automatically classify a resource according to a defined ontologies.
For further contact about the project, please contact Boris Chidlovskii