Summarizing big data using submodular functions

21st April 2016

Baharan Mirzasoleiman , doctoral candidate at Swiss Federal Institute of Technology, Zürich, Switzerland


Many large-scale machine learning problems–clustering, non-parametric learning, kernel machines, etc.–require selecting a small yet representative subset from a large dataset. Such problems can often be reduced to maximizing a submodular set function subject to various constraints. Submodular functions exhibit a natural diminishing returns property: the marginal benefit of any given element decreases as we select more and more elements. Although maximizing a submodular function is NP-hard in general, a simple greedy algorithm produces solutions competitive with the optimal (intractable) solution. However, this greedy algorithm is impractical for truly large-scale problems where the data is residing on disk, or arriving/changing over time at a fast pace. In this talk, we consider the problem of submodular function maximization in distributed and streaming fashions. We briefly overview the theoretical results, as well as the effectiveness of these techniques on several real-world applications on millions of data points