Publication Search Form




We found publication with these paramters.

Crawling for domain-specific hidden web resources

Andre Bergholz, Boris Chidlovskii
The hidden web, the part of the Web that remains unavailable for standard crawlers, has become an important research topic during recent years. It is estimated that its size is about 400 to 500 times larger than that of the Publicly Indexable Web (PIW). Furthermore, the information on the Hidden Web is assumed to be more structured, because it is usually stored in databases. In this paper we describe a crawler, which starting from the PIW finds entry points into the Hidden Web. The crawler is domain specific, because it can be initialized with keywords. We describe a way to identify information providers among teh Hidden Web entry points. We conduct experiments using the top-level categories of Google-directory and discovery various properties of the Hidden Web. First, entry points to the Hidden Web can usually be found within three or four steps of crawling on a site. Second, the number of Hidden Web sites is highly domain-specific. Third, the information providers we find are also highly domain-specific, there is hardly any overlap among information providers fro different domains.
International Conference on Web Information Systems Engineering (WISE) in Rome, Italy, December 10-12.