Home page Site map Contact
   

PARSING & SEMANTICS

ParSem concentrates on automatically making sense of electronic documents, by semantically analyzing them. ParSem concentrates on two main research lines of natural language processing  : robust parsing and semantics


 

ROBUST PARSING

Robust parsing provides mechanisms for identifying major syntactic structures and major functional relations between words on large collections of unrestricted documents (ex: Web pages, newspapers, scientific literature, encyclopedias). Xerox Incremental Parsing (XIP) is a formalism that smoothly integrates a number of description mechanisms for shallow and deep robust parsing, ranging from part-of-speech disambiguation, entity recognition and chunking to dependency grammars and extra-sentencial processing. XIP grammars have been developed for a number of languages, including French, English and some others are being developed outside Xerox (Japanese, Chinese, German, Czech). Major applications include contextual entity recognition, lexical and structural disambiguation, coreference resolution and more globally knowledge extraction.
 
 

DESCRIPTION

Xerox robust parsers are based on a specific methodology, called incremental parsing, initially implemented in a finite-state framework (Incremental Finite-State Parsing or IFSP [Aït-Mokhtar & Chanod 1997, Gala 1999]). Recent developments include a totally new parsing framework, XIP (Xerox Incremental Parsing) [Aït-Mokhtar, Chanod and Roux 2001, 2002]. XIP retains most of the IFSP methodology (esp. the incremental organization of the rules and the double output under the form of annotated chunks and sets of dependencies). However, the underlying XIP formalism and the XIP parsing engine are totally new. This leads to improved computational performance and increased expressive power, allowing one to describe more fine-grained linguistic phenomena in a more efficient way. Indeed, XIP is designed for building robust analysers that tackle deeper linguistic aspects than those traditionally handled by the now widespread shallow parsers.

The rule formalism has been designed specifically for deep parsing and can handle rich and fine-grained lexical and dependency descriptions via feature lists. Linguistic descriptions are organized in ordered modules, depending on their depth level. Modularity facilitates the maintenance of linguistic data and makes the system easily customizable or reusable.

The rule formalism allows one to recognize n-ary linguistic relations between words or constituents as a consequence of global or local structural, topological and/or lexical conditions. It offers the advantage of accepting various types of inputs, ranging from raw to chunked or constituent-marked texts, which makes possible the exploitation of existing annotated corpora or the use of the system as a front-end deep analyser for existing shallow parsers. It has been successfully used to build deep functional dependency grammars, as well as for the task of coreference resolution, in a modular way.

More details about the formalism are available in the XIP user's guide and XIP reference guide. A XIP tutorial is also available (70 slides). (Contact: F. Segond)

XIP demos.

 

 

SEMANTICS

With the goal of transforming documents into “meaningful spaces”, the main focus has to be semantics. Semantics is everywhere, hidden in completely different types of documents (e.g. text, images, videos, programs and audio) and at different levels (e.g. document content, document structure). Because most of the “semantics” that is nowadays accessible in documents lies in texts, we concentrate on the semantic content analysis of the textual parts of documents. This textual part also includes document structure (for instance information already encoded into tags and user profiles). Our goal is not to investigate the fundamental nature of meaning, so we concentrate on the linguistic meaning. 


DESCRIPTION

A unifying theme in the ongoing research in the ParSem area is an emphasis on the role of context  in determining meaning. We are particularly interested in theoretical models of communication, language, dialogue, computation, and inference which take into account the context in which these activities are occurring. 
We are also interested in applying research results to practical applications and real-world problems.  Our general application focus is information discovery.

Our current research themes include:
 

  •  Ontology Acquisition: The word “ontology” traditionally refers in philosophy to the description of the universe. In computational linguistic, the word ontology applies to the description of knowledge. Ontology in that sense is defined as a set of concepts and a set of relations. Each concept is described against the other concepts through one or more relations. 

  • We build tools that can discover that two concepts are related somehow, by noticing that expressions denoting those concepts are frequently linked together syntactically in a corpus.  We explore the idea that the range of syntactic constructions that can be used to link two concepts may provide information about the nature of the relationship(s) that can exist between those concepts.  This information could subsequently be used to enrich the representation of a document's content with entities and relations that are implied, but not explicitly stated.
     
     
  •  Semantic Disambiguation : WSD aims at associating a given word in text or discourse with a definition or meaning or semantic class (sense) that is distinguishable from other meanings potentially attributable to that word. This task involves two steps:

  • The determination of all different senses for every word relevant at least to the text or discourse under consideration.  Precise de definition of what a sense is a matter of debate but much of recent approaches rely on predefined senses such as a list of senses given in a dictionary, associated words, entries in a transfer dictionary, etc. 
    The assignment of word to senses is done using 2 sources of information:
    The linguistic context of the word to be disambiguated (and maybe some extra-linguistic knowledge about situation, etc.)
    External knowledge sources including lexical, encyclopedic, etc.

    All disambiguation processes  involve matching the context of an instance of the word to be disambiguated with information from an external knowledge source (knowledge-driven WSD) or information about the contexts of previously disambiguated instances of the word derived from corpora (data-driven WSD or corpus-based WSD). 
    Two major types of techniques are emerging in WSD: Statistical supervised systems and unsupervised knowledge-based systems. But in the last Senseval/Romanseval competition, it has been noted that several unsupervised systems made use of the training data to fine-tune their results and that several supervised systems had a lexical resource as a fall back where the data were insufficient. A combination of methodologies seems therefore to be the trend for the future of WSD.. 
     

  •  Linguistic Normalization: The aim of normalization is to provide, taking as a basis a syntactic description of an input text, a more abstract representation of this input text having in order to make a step towards semantic representation. Current work on normalization done by ParSem area can be seen under two points of view:

  •  
      • General normalization
      • Domain specific normalization
       
  •  Co-reference: The coreference resolution task aims at establishing equivalence between entities that are mentioned in a text. The first phase of the project deals with pronominal coreference. It mainly focuses on personal pronouns (I, he/she, they, ...), which were shown to be the most frequent in texts.

  •  
  •  Discourse Analysis: In this project we explore representations that facilitate the recognition of non-lexicalized, non-conventional expressions for a given concept.

 

ONGOING ACTIVITIES

  • Pre-processing components
  • XIP engine
  • Grammar development
  • Entity recognition
  • Ontology Acquisition
  • Semantic Disambiguation
  • Linguistic Normalization
  • Co-reference
  • Discourse Analysis
  • VIKEF: contact: S. Aït-Mokhtar
General contact for Parsing and Semantics activities: parsem@xrce.xerox.com
 
 
 

PUBLICATIONS

     

PEOPLE