Nicola Cancedda, Eric Gaussier, Cyril Goutte, Jean-Michel Renders
We address the problem of categorising documents using kernel-based methods such as support vector
machines. Since the work of Joachims (1998), there is ample experimental evidence that SVM using the
standard word frequencies as features yield state-of-the-art performance on a number of benchmark problems.
Recently, Lodhi et al. (2002) proposed the use of string kernels, a novel way of computing document similarity
based of matching non-consecutive subsequences of characters. In this articles, we propose the use of this
technique with sequences of words rather than characters. This approach has several advantages, in particular
it is more efficient computationally and it ties in closely with standard linguistic pre-processing techniques.
We present additional extension to sequence kernels dealing with symbol-dependent and match-dependent
decay factors, soft-matching of symbols, and the implementation of sequence kernels for cross-lingual
The Journal of Machine Learning Research
cancedda03a.pdf (235.79 kB)