Home page Site map Contact
  

 

CA : CONTENT-ANALYSIS

OVERVIEW :

With the multiplication of on-line document repositories and the phenomenal growth of the Web, a fantastic amount of information is available at our fingertips. The central problem becomes that of quickly accessing, within that mass, the arbitrary pieces of information that are needed at any given time. As a large proportion of the data is made up of natural language texts, any comprehensive solution will rely heavily on natural language processing (NLP).Our research agenda concerns theories, methods, tools and systems that make it possible to uncover the content of natural language texts. This includes:

  • assessing overall document relevance with respect to a given information need (information retrieval, filtering, document categorization and clustering);

  • detailed text understanding  to relate specific text segments (words, sentences, etc.) with the situations they describe, e.g. situations in which certain entities (such as people, companies, stocks, genes) participate in certain relationships (such as "company A acquired company B" or  "gene X interacts with protein Y"); 

  • document content modelling for the purpose of capturing the relationship between high-level communicative goals and the surface form of documents, in particular for specialized sublanguages.

 

RESEARCH THEMES :

Our research agenda is spelled out in terms of convergent contributions from different core competences. They comprise:

Finite State Technology (FST)

The Finite State Technology research concentrates on tools for specifying and manipulating finite state automata (acceptors and transducers).  Our tools (xfst, twolc, lexc) are built on top of a software library that provides algorithms for creating automata from regular expressions and contains both classical operations such as union or composition and also new algorithms such as replacement or local sequentialisation. Over the years, the products of our research have come to be used all over the world in many linguistic applications such as morphological analysis, tokenization or shallow parsing of a wide variety of natural languages. The xfst tool has been licensed to over 70 universities world-wide. Many components have already been incorporated into commercial software.

Machine Learning (ML)

The Machine Learning team explores the rich potential of machine learning techniques for the development of systems that will help people access the information they need in large scale document repositories. Current research focuses on two different subgoals: 1) deep understanding of learning algorithms well adapted to the specific needs of clustering, categorization and retrieval of natural language data; and 2) methods and tools for acquiring new domain-specific linguistic resources (lexicons, thesauri and ontologies).

Robust Parsing 

Robust parsing provides mechanisms for identifying major syntactic structures and major functional relations between words on large collections of unrestricted documents (Web pages, newspapers, encyclopedias). This is a prerequesite for  fine-grained linguistic analysis over large collections of texts. XRCE's approach to robust parsing relies on XIP (Xerox Incremental Parser) a high speed parsing framework for building chunkers, dependency grammars and beyond.

Semantics

Semantic analysis relates linguistic expressions with their non-linguistic content: entities, relations, facts, and events in the world. It is a task that we target for specific events in specific domains (e.g. genomic interactions described in scientific literature or product launches mentioned in companies web pages).

Relevant research topics include word sense disambiguation, coreference tracking, paraphrase, ontologies, knowledge representation and inferencing.