Full and Mini-Batch Clustering of News Articles with Star-EM
Matthias Gallé, Jean-Michel Renders
We present a new threshold-based clustering algorithm for
news articles. The algorithm consists of two phases: in the first, a local
optimum of a score function that captures the quality of a clustering is
found with an Expectation-Maximization approach. In the second phase,
the algorithm reduces the number of clusters and, in particular, is able
to build non-spherical—shaped clusters. We also give a mini-batch version
which allows an efficient dynamic processing of data points as they arrive
in groups. Our experiments on the TDT5 benchmark collection show the
superiority of both versions of this algorithm compared to other state-of-the-art alternatives.
34th European Conference on Information Retrieval, Barcelona, Spain, 1-5 April 2012.