Semi-supervised Categorization of Wikipedia collection by Label Expansion

Boris Chidlovskii
We address the problem of categorizing a large set of linked documents with important content and structure aspects, for example, fromWikipedia collection proposed at the INEX XML Mining track.We cope with the case where there is a small number of labeled pages and a very large number of unlabeled ones. Due to the sparsity of the link based structure of Wikipedia, we apply the spectral and graph-based techniques developed in the semi-supervised machine learning. We use the content and structure views of Wikipedia collection to build a transductive categorizer for the unlabeled pages. We report evaluation results obtained with the label propagation function which ensures a good scalability on sparse graphs.
Will appear in the Proceeding of INEX 2008, Dagstulh, Germany


2008-078.pdf (245.71 kB)