How machines can learn what humans interpret: adapting probabilistic topic models to natural language.

PDF version of the article

During the past decade, we have seen an explosion in the amount of digital information. This has led to information overload, making it difficult for humans to make sense of very large document collections, such as emails, digital libraries, news articles or legal documents.

Probabilistic topic models, in part pioneered by Xerox under the trademark Smarter Document Management Technologies , are now routinely used to analyse and explore large sets of documents.[1] Topic models automatically organize documents into semantic clusters or topics, based on the statistical properties of the text that lies within. Their tremendous success (more than 6,500 citations in Google Scholar at the time of writing) can be attributed to their simplicity and appealing interpretation. Yet while successful, current topic models still do not account for the fundamental properties of natural language, which would lead to more diverse and interpretable topics.

What are probabilistic topic models?

Topic models extract human intelligible topics from text in an unsupervised way. This means that the clusters of documents are automatically learnt from data without any human intervention. Probabilistic topic models posit a generative process for document collections: they propose a probabilistic model (i.e., a set of interdependent random variables), which describe how documents are generated.

To capture the semantics, the key simplifying assumption made in topic models is that documents can be represented by a mixture of topics, which ignore the word order of the text. While rather basic, this simplifying assumption has proven to be effective in practice when one is only interested in extracting the topics. Providing computers with the capability to recognize the topics of a document enables them to identify documents discussing similar content and then suggest these findings to the human user.

Figure 1 is an example of four topics and a piece of text, where each word is assigned to one of the four topics (arts, budgets, children, education). Each topic is defined by a list of vocabulary words, each one being assigned a probability. A document is then assumed to be generated from these topics as follows:

  1. For each document in the collection:
    1. Decide the number of words in the document;
    2. Choose a weight for each topic. This weight corresponds to the prevalence of each topic in the document.
  2. Each word in the document is then generated as follows:
    1. Select a topic. The probability of a topic being selected is proportional to the weight it was assigned in 1.b;
    2. Choose a word in the vocabulary according to the selected topic. The probability of a word being selected is proportional to a topic-specific weight.

This probabilistic model not only proposes an appealing generative model of documents, but it also enjoys a relatively simple inference procedure (a collapsed Gibbs sampler to be precise) based on simple word counts, which is able to handle millions of documents in a couple of minutes.[2] Inference is the process of deciding which topics should be associated with the documents. It is done automatically, based on the data-driven evidence. Knowing the topic association is useful in practice as it enables one, for example, to recover documents that share the same set of topics.

Figure 1 (reproduced from reference 1). Topics are defined by a list of words. Each column corresponds to a topic and the words in the list are ranked according to their relevance. The words in the boxed text are coloured according to the topics shown at the top. Each word is modelled as being drawn independently from one of these topics, neglecting the sequential structure of text.

Weaknesses of standard probabilistic topic models

A practical issue with topic models is the identification of the most likely number of topics describing the data. This is because the identification is a computationally expensive procedure. When modelling real data, the number of topics is expected to grow logarithmically with the size of the corpus. When the number of documents in the corpus increases, it is reasonable to assume that new topics will appear, but that the increase will not be linear with the number of documents; there will be a saturation effect. The issue can be dealt with in a principled way by considering nonparametric Bayesian extensions, a recent trend in probabilistic machine learning.[3]

A second weakness of topic models is their limited expressiveness. The prevalence of a topic in the corpus is correlated with its prevalence in individual documents. Similarly, the prevalence of a word occurring in the corpus is correlated with its prevalence in the individual topics. These are undesirable properties. For example, a good model should be able to identify that a word characterizes a specific topic irrespective of its frequency in the document collection.

Finally, and perhaps most importantly, the probabilistic model postulated by topic models are inappropriate for modelling real text. Data sampled from the model are statistically distributed differently than real observations. For example, it is well-known that modern languages exhibit power-law properties (see Figure 2). This means that human languages have a very heavy tail: few words are extremely frequent, while many words are very infrequent. This is not accounted for in classical topic models.

Figure 2 shows the ordered word frequencies of four benchmark corpora available from http://archive.ics.uci.edu/ml . Let  be the frequency of word  in the corpus. It can be observed that the ranked word frequencies follow Zipf’s law, which is an example of a power-law distribution: , where  is a positive constant. Like many natural phenomena, human languages including English exhibit this property. Intuitively, this means that human languages have a very heavy tail: few words are extremely frequent, while many words are very infrequent.

Using probabilistic topic models with power-law characteristics in an idea management system

The data we observe in practice, such as text, images or social networks, show significant departures from standard distributions encountered in statistics. When our target application is to automatically organize a large set of documents according to topics, we should use models that are able to learn a potentially infinite number of topics and capable of accounting for the power-law characteristics of natural language. Moreover, we would like to increase the model expressiveness, either by allowing more diverse topic distributions, or by favouring more specialized topics, while preserving a simple and efficient inference procedure. This can be achieved by basing topic models on a stochastic process called the Indian Buffet Process (IBP). [4], [5]

The generative model for a document resulting from the IBP-based topic model is similar to that of the standard topic model, except that a small subset of topics is selected before assigning them a weight. Similarly, each topic is defined by a relatively small subset of the vocabulary words, but which follow a power-law. The IBP operates as a binary mask on the discrete distributions defining topics and their association to documents. Topics extracted from the corpus are more specific, possibly assigning a large weight to infrequent, but informative words and they are more discriminative. We observed experimentally that fewer topics were associated to each document.

We currently are exploring how this new type of topic models can be integrated into an Idea Management System (IMS), which can be viewed as a collaborative brainstorming system. In its most simple form, an IMS is a so-called suggestion box, where customers and/or employees can submit feedback or make suggestions for product/service improvements. Large companies such as IBM, Dell, Microsoft, Whirlpool, UBS or Starbucks have deployed such systems to better support innovation with the aim of capturing the collective wisdom residing in the employee and/or customer base. Xerox is adapting IMS to other domains, such as urban planning and policy design, facilitating the communication between citizens and political decision makers through an IMS with advanced filtering, browsing and aggregating capabilities.

However, when a large number of ideas are collected, it quickly becomes very time consuming to identify common themes, as well as overlaps, duplicates and related ideas. The system we are developing aims to facilitate this process for the decision maker by providing him or her tools to explore, curate and aggregate ideas. Probabilistic topic models with power-law characteristics are very useful in this context as they enable users and curators to find more relevant and targeted topics, increasing the relevance of retrieved documents and improving their browsing experience.

 About the author: Cédric Archambeau is Area Manager of the Machine Learning group at Xerox Research Centre Europe. He also holds an Honorary Senior Research Associate position in the Centre for Computational Statistics and Machine Learning at University College London. His research interests include probabilistic machine learning and data science, with applications in natural language processing, relational learning, personalised content creation and data assimilation.

[1] D. M. Blei, A. Y. Ng, M. I. Jordan: Latent Dirichlet allocation. Journal of Machine Learning Research 3 (4–5): 993–1022, 2003.[2] T. L. Griffiths and M. Steyvers: Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228–5235, 2004.[3] Y. W. Teh, M. I. Jordan, M. J. Beal, D. M. Blei: Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006.[4] C. Archambeau, B. Lakshminarayanan, G. Bouchard : Latent IBP compound Dirichlet Allocation. To appear in IEEE transactions in Pattern Analysis and Machine Intelligence.[5] Z. Ghahramani, T. Griffiths, P. Sollich: Bayesian nonparametric latent feature models (with discussion). Bayesian Statistics 8:201–226, 2007.