Publications
Authors:
  • Boris Chidlovskii
Citation:
Will appear in the Proceeding of INEX 2008, Dagstulh, Germany
Abstract:
We address the problem of categorizing a large set of linked documents with important content and structure aspects, for example, fromWikipedia collection proposed at the INEX XML Mining track.We cope with the case where there is a small number of labeled pages and a very large number of unlabeled ones. Due to the sparsity of the link based structure of Wikipedia, we apply the spectral and graph-based techniques developed in the semi-supervised machine
learning. We use the content and structure views of Wikipedia collection to build a transductive categorizer for the unlabeled pages. We report evaluation results obtained with the label propagation function which ensures a good scalability on sparse graphs.
Year:
2008
Report number:
2008/078
Attachments: