 |
 |
 |
 |
MACHINE LEARNING FOR TEXTUAL INFORMATION ACCESS
We are interested in machine learning as a powerful tool in developing systems that will
help people access the information they need in large scale text repositories. Our current
research focuses on two different subgoals:
- deep understanding of learning algorithms well adapted to the specific needs of clustering,
categorization and retrieval of natural language data;
- methods and tools for acquiring new domain-specific linguistic resources (lexicons, thesauri
and ontologies).
DESCRIPTION

Machine Learning (ML) is a general paradigm aiming at the estimation of the parameters of an
unobserved system given observed samples (also called examples). As such, ML can replace and/or
supplement the traditional development of hand-coded rule-based systems. A direct consequence is
that ML can be used to acquire useful lexical information, since such information is usually
obtained through rule application. Thus, ML is often seen as a solution to the data acquisition
bottleneck.
We focus on three main areas at the very heart of textual information access, and in which ML
plays an important role. The first one deals with the automatic acquisition of relevant textual
units and their typing according to a given set of types. The second one deals with the automatic
acquisition of relevant links between these units. Lastly, the third one deals with the general
problem of text categorization and clustering.
- An ML approach to extraction and typing requires the developer/user to provide only
(some) examples of extracted and typed data, while the rule induction task is delegated to
a computer program. Our ML approach combines statistical evaluation and grammatical induction.
Statistical evaluation is used to filter hypotheses, while grammatical induction methods are
mainly used in wrapper generation. We see the latter problem as the inference of
regular/context-free languages/transducers from sample strings or translations.
- Our approach to the problem of linking textual units is based on the assumption that
interesting relations can be acquired from texts (certain patterns are for example characteristic
of the "is_a" relation). It relies on the combination of linguistic processing --to derive
interesting features-- and statistical induction --to infer relations from the features extracted.
On the linguistic side, chunkers and parsers constitute the adequate (i.e. robust, sufficiently
precise) tools required. On the learning side, statistical methods, such as probabilistic latent
semantic models, allowing inference of new concepts and links between them are required.
- Our current capabilities in document categorization and clustering will be extended in the
direction of the use of Support Vector Machines (SVMs) and other kernel-based techniques. SVMs
search for an optimal hyperplane linearly separating positive examples from negative ones.
This linear separation usually takes place in a high-dimensional space, obtained through a
mapping (via a class of functions called Kernel fuctions) of the input space. Kernel methods
are probably the most promising innovative approach for measuring similarities between documents,
and can deal in a uniform way with the different components of a multimedia document.
RECENT EXTERNAL PROJECTS

- KerMIT (Kernel Methods for Image and Text classification, clustering, ranking and filtering) concerns the development of algorithms and software for the classification, clustering, ranking and filtering, both in an online and offline setting, of digital documents.
- MUCHMORE (Multilingual Concept Hierarchies for Medical
Information Organisation and Retrieval) Transatlantic project (NSF/EEC) involving: CMU-LTI (USA),
Stanford-CSLI (USA), DFKI (Germany), Zinfo (Germany), EIT (Switzerland) and XRCE (France). This
project aims to develop technologies and a prototype system for multilingual information
organization and access for the medical domain. XRCE will focus on terminology extraction from
parallel and comparable corpora.
- "Alliances" - Project funded by the French Ministry of Research and Technologies, involving:
LIMSI (Orsay), LIP6 (Paris), la Fondation Charles Léopold Mayer pour le Progrès
de l'Homme (Paris), and XRCE (Grenoble). This project aims at developing tools to help navigate
in a collection of documents and help identify agreements and disagreements between authors.
- TransType2 (Computer Assisted Translation): an RTD project
funded by the European Commission under the IST Programme.
The aim of TT2 is to develop a
Computer-Assisted Translation (CAT) system that will help solve a very pressing social problem:
how to meet the growing demand for high-quality translation. The innovative solution proposed by
TT2 is to embed a data driven Machine Translation (MT) engine within an interactive translation
environment.
TransType2 partners
PEOPLE

Team Members:
Associates:
|
 |