![]() |
|
| |
|
|
|
|
PARSING & SEMANTICS ParSem concentrates on automatically making sense of electronic documents, by semantically analyzing them. ParSem concentrates on two main research lines of natural language processing : robust parsing and semantics
Robust parsing provides mechanisms for identifying major syntactic
structures and major functional relations between words on large collections
of unrestricted documents (ex: Web pages, newspapers, scientific literature,
encyclopedias). Xerox Incremental Parsing (XIP) is a formalism that
smoothly integrates a number of description mechanisms for shallow and
deep robust parsing, ranging from part-of-speech disambiguation, entity
recognition and chunking to dependency grammars and extra-sentencial
processing. XIP grammars have been developed for a number of languages,
including French, English and some others are being developed outside
Xerox (Japanese, Chinese, German, Czech). Major applications include
contextual entity recognition, lexical and structural disambiguation,
coreference resolution and more globally knowledge extraction. DESCRIPTION Xerox robust parsers are based on a specific methodology, called incremental parsing, initially implemented in a finite-state framework (Incremental Finite-State Parsing or IFSP [Aït-Mokhtar & Chanod 1997, Gala 1999]). Recent developments include a totally new parsing framework, XIP (Xerox Incremental Parsing) [Aït-Mokhtar, Chanod and Roux 2001, 2002]. XIP retains most of the IFSP methodology (esp. the incremental organization of the rules and the double output under the form of annotated chunks and sets of dependencies). However, the underlying XIP formalism and the XIP parsing engine are totally new. This leads to improved computational performance and increased expressive power, allowing one to describe more fine-grained linguistic phenomena in a more efficient way. Indeed, XIP is designed for building robust analysers that tackle deeper linguistic aspects than those traditionally handled by the now widespread shallow parsers. The rule formalism has been designed specifically for deep parsing and can handle rich and fine-grained lexical and dependency descriptions via feature lists. Linguistic descriptions are organized in ordered modules, depending on their depth level. Modularity facilitates the maintenance of linguistic data and makes the system easily customizable or reusable. The rule formalism allows one to recognize n-ary linguistic relations between words or constituents as a consequence of global or local structural, topological and/or lexical conditions. It offers the advantage of accepting various types of inputs, ranging from raw to chunked or constituent-marked texts, which makes possible the exploitation of existing annotated corpora or the use of the system as a front-end deep analyser for existing shallow parsers. It has been successfully used to build deep functional dependency grammars, as well as for the task of coreference resolution, in a modular way. More details about the formalism are available in the XIP user's guide
and XIP reference guide. A XIP tutorial is also available (70 slides).
(Contact: F. Segond)
With the goal of transforming documents into “meaningful spaces”, the main focus has to be semantics. Semantics is everywhere, hidden in completely different types of documents (e.g. text, images, videos, programs and audio) and at different levels (e.g. document content, document structure). Because most of the “semantics” that is nowadays accessible in documents lies in texts, we concentrate on the semantic content analysis of the textual parts of documents. This textual part also includes document structure (for instance information already encoded into tags and user profiles). Our goal is not to investigate the fundamental nature of meaning, so we concentrate on the linguistic meaning.
DESCRIPTION A unifying theme in the ongoing research in the ParSem area is an emphasis on the role of context in determining meaning. We are particularly interested in theoretical models of communication, language, dialogue, computation, and inference which take into account the context in which these activities are occurring.We are also interested in applying research results to practical applications and real-world problems. Our general application focus is information discovery. Our current research themes include:
We build tools that can discover that two concepts are related somehow, by noticing that expressions denoting those concepts are frequently linked together syntactically in a corpus. We explore the idea that the range of syntactic constructions that can be used to link two concepts may provide information about the nature of the relationship(s) that can exist between those concepts. This information could subsequently be used to enrich the representation of a document's content with entities and relations that are implied, but not explicitly stated. The determination of all different senses for every word relevant at least to the text or discourse under consideration. Precise de definition of what a sense is a matter of debate but much of recent approaches rely on predefined senses such as a list of senses given in a dictionary, associated words, entries in a transfer dictionary, etc. The assignment of word to senses is done using 2 sources of information: The linguistic context of the word to be disambiguated (and maybe some extra-linguistic knowledge about situation, etc.) External knowledge sources including lexical, encyclopedic, etc. All disambiguation processes involve matching the context of
an instance of the word to be disambiguated with information from
an external knowledge source (knowledge-driven WSD) or information
about the contexts of previously disambiguated instances of the word
derived from corpora (data-driven WSD or corpus-based WSD).
• Domain specific normalization
PEOPLE
|
|