Learning the distance between document pages
Semisupervised clustering algorithms enhance traditional clustering algorithms by making use of additional information about which points are in the same cluster or which are not. We take this idea to the extreme by providing a complete reference clustering against which generated clusterings are evaluated. We use this evaluation to learn the distance between points. To this end, we decompose the distance into feature distances and learn the weights of the individual features. We propose differnet methods to learn these weights and apply our technique in the domain of document page clustering. Our goal is to identify groups of pages that can be later on treated in the same manner. We conduct experiments using different document collections and obtain good results, outperforming methods from the literature.
EWMF Workshop on datamining for business, Porto, Portugal, 3rd October, 2005.