
Robust parsing provides mechanisms to identify major syntactic structures and major functional relations between words on large collections of unrestricted documents (ex: web pages, newspapers, scientific literature, encyclopedias). Xerox Incremental Parsing (XIP) is a formalism that smoothly integrates a number of description mechanisms for shallow and deep robust parsing, ranging from part-of-speech disambiguation, entity recognition and chunking to dependency grammars and extra-sentential processing. XIP grammars have been developed for a number of languages, including French, English and some others are being developed outside Xerox (Japanese, Chinese, German, Czech). Major applications include contextual entity recognition, lexical and structural disambiguation, coreference resolution and more globally knowledge extraction.
Xerox robust parsers are based on a specific methodology, called incremental parsing, initially implemented in a finite-state framework (Incremental Finite-State Parsing or IFSP [Aït-Mokhtar & Chanod 1997, Gala 1999]). Recent developments include a totally new parsing framework, XIP (Xerox Incremental Parsing) [Aït-Mokhtar, Chanod and Roux 2001, 2002]. XIP retains most of the IFSP methodology (esp. the incremental organization of the rules and the double output under the form of annotated chunks and sets of dependencies). However, the underlying XIP formalism and the XIP parsing engine are totally new. This leads to improved computational performance and increased expressive power, allowing one to describe more fine-grained linguistic phenomena in a more efficient way. Indeed XIP has been designed to build robust analysers that tackle deeper linguistic aspects than those traditionally handled by the now widespread shallow parsers.
The rule formalism has been designed specifically for deep parsing and can handle rich and fine-grained lexical and dependency descriptions via feature lists. Linguistic descriptions are organized in ordered modules, depending on their level of depth. Modularity facilitates the maintenance of linguistic data and makes the system easily customizable or reusable.
The rule formalism allows one to recognize n-ary linguistic relations between words or constituents as a consequence of global or local structural, topological and/or lexical conditions. It offers the advantage of accepting various types of inputs, ranging from raw to chunked or constituent-marked texts, which makes possible the exploitation of existing annotated corpora or the use of the system as a front-end deep analyser for existing shallow parsers. It has been successfully used to build deep functional dependency grammars, as well as for the task of coreference resolution, in a modular way.
More details about the formalism are available in the XIP user's guide and XIP reference guide. A XIP tutorial is also available (70 slides). (Contact: F. Segond)