Publications
Authors:
  • Andre Bergholz , Boris Chidlovskii
Citation:
International Conference on Web Information Systems Engineering (WISE) in Rome, Italy, December 10-12.
Abstract:
The hidden web, the part of the Web that remains unavailable for standard crawlers, has become an important
research topic during recent years. It is estimated that its size is about 400 to 500 times larger than that of
the Publicly Indexable Web (PIW). Furthermore, the information on the Hidden Web is assumed to be more
structured, because it is usually stored in databases. In this paper we describe a crawler, which starting from
the PIW finds entry points into the Hidden Web. The crawler is domain specific, because it can be initialized
with keywords. We describe a way to identify information providers among teh Hidden Web entry points. We
conduct experiments using the top-level categories of Google-directory and discovery various properties of the
Hidden Web. First, entry points to the Hidden Web can usually be found within three or four steps of crawling
on a site. Second, the number of Hidden Web sites is highly domain-specific. Third, the information providers
we find are also highly domain-specific, there is hardly any overlap among information providers fro different
domains.
Year:
2003
Report number:
2003/036