Bias and variance of learning with weighted automata

19th November 2015

Lancaster University

Speaker: Borja Balle , professor at Lancaster University, Lancaster, U.K 

Abstract: Weighted automata (WA) are a general class of finite state machines that include many widely used models like hidden Markov models, deterministic finite automata, and predictive state representations. Traditional algorithms for learning WA were developed for particular problems and sub-classes of models (eg. EM for HMM and state-merging for deterministic automata). In recent years, the introduction of spectral techniques has provided a new standard tool for designing efficient algorithms for learning WA that can be applied in many different circumstances (eg. stochastic WFA, transductions, regression on strings). The most salient features of the spectral approach are: polynomial running time without local optima, PAC guarantees in the realizable setting, and very good performance in practice. The last two items in this list suggest spectral learning is a tool to consider in many practical applications, but they also show there is still a gap between the theory and practice of spectral learning. Namely, that in general there is no theoretical explanation for the good behavior of spectral learning with real data.

The purpose of this talk is to present two recent lines of work that shed some light onto the problem of spectral learning without realizability assumptions, and to discuss how these can lead to better algorithms by informing regularization and model selection for WA. Our first set of results looks into how the number of states of the hypothesis WA affects the learning bias. These investigations lead us to the development of a novel canonical form for WA we call the Singular Value Automaton (SVA), which can be interpreted as a compressed representation of the SVD of an infinite Hankel matrix. We will explain how SVA truncation is related to the bias of spectral learning and describe the potential learning applications of efficient algorithms for computing the SVA of a given WA. If time permits, a recent extension of SVA to Weighted Tree Automata will also be discussed. In the second part of the talk I will describe three families of regularizers for learning with WA and provide bounds on the Rademacher complexity of classes of WA defined in terms of these regularizers. These results can be used to understand the different sources of variance when learning WA, and provide principled ways of controlling the capacity of hypothesis classes based on WA. In particular, our approach provides a theoretical justification for regularizing Hankel matrix completion problems with nuclear norm penalties.

This talk is based on joint work with: Shay Cohen, Mehryar Mohri, Prakash Panangaden, Doina Precup, and Guillaume Rabusseau.