A hierarchical model for clustering and categorising documents
Eric Gaussier, Cyril Goutte, Kris Popat, Francine Chen
Advances in Information Retrieval -- Proceedings of the 24th BCS-IRSG European Colloquium on IR Research (ECIR-02), Glasgow, March 25-27, 2002.
Lecture Notes in Computer Science 2291, pp. 229-247, Springer.
We propose a new hierarchical generative model for textual data, where words may be generated by topic
specific distributions at any level in the hierarchy. This model is naturally well suited to clustering documents
in preset or automatically generated hierarchies, as well as categorising new documents in an existing
hierarchy. Training algorithms are derived for both cases and illustrated on real data by clustering news
stories and categorising newsgroup messages. Finally, the generated model may be used to derive a Fisher
kernel expressing similarity between documents.
gaussier02hierarchical.ps.gz (98.94 kB)