|
|
|
![]() |
|
|
|
|
|
|
|
|
|
|
CA : CONTENT-ANALYSIS OVERVIEW : With the multiplication of on-line document repositories and the phenomenal growth of the Web, a fantastic amount of information is available at our fingertips. The central problem becomes that of quickly accessing, within that mass, the arbitrary pieces of information that are needed at any given time. As a large proportion of the data is made up of natural language texts, any comprehensive solution will rely heavily on natural language processing (NLP).Our research agenda concerns theories, methods, tools and systems that make it possible to uncover the content of natural language texts. This includes:
Our research agenda is spelled out in terms of convergent contributions from different core competences. They comprise: The Finite State Technology research concentrates on tools for specifying and manipulating finite state automata (acceptors and transducers). Our tools (xfst, twolc, lexc) are built on top of a software library that provides algorithms for creating automata from regular expressions and contains both classical operations such as union or composition and also new algorithms such as replacement or local sequentialisation. Over the years, the products of our research have come to be used all over the world in many linguistic applications such as morphological analysis, tokenization or shallow parsing of a wide variety of natural languages. The xfst tool has been licensed to over 70 universities world-wide. Many components have already been incorporated into commercial software. The Machine Learning team explores the rich potential of machine learning techniques for the development of systems that will help people access the information they need in large scale document repositories. Current research focuses on two different subgoals: 1) deep understanding of learning algorithms well adapted to the specific needs of clustering, categorization and retrieval of natural language data; and 2) methods and tools for acquiring new domain-specific linguistic resources (lexicons, thesauri and ontologies).
Robust parsing provides mechanisms for identifying major syntactic structures and major functional relations between words on large collections of unrestricted documents (Web pages, newspapers, encyclopedias). This is a prerequesite for fine-grained linguistic analysis over large collections of texts. XRCE's approach to robust parsing relies on XIP (Xerox Incremental Parser) a high speed parsing framework for building chunkers, dependency grammars and beyond. Semantic analysis relates linguistic expressions with their non-linguistic content: entities, relations, facts, and events in the world. It is a task that we target for specific events in specific domains (e.g. genomic interactions described in scientific literature or product launches mentioned in companies web pages). Relevant research topics include word sense disambiguation, coreference tracking, paraphrase, ontologies, knowledge representation and inferencing.
|
|