Home page Site map Contact
  

 

MACHINE LEARNING FOR TEXTUAL INFORMATION ACCESS

We are interested in machine learning as a powerful tool in developing systems that will help people access the information they need in large scale text repositories. Our current research focuses on two different subgoals:

  • deep understanding of learning algorithms well adapted to the specific needs of clustering, categorization and retrieval of natural language data;

  • methods and tools for acquiring new domain-specific linguistic resources (lexicons, thesauri and ontologies).

 

DESCRIPTION

Machine Learning (ML) is a general paradigm aiming at the estimation of the parameters of an unobserved system given observed samples (also called examples). As such, ML can replace and/or supplement the traditional development of hand-coded rule-based systems. A direct consequence is that ML can be used to acquire useful lexical information, since such information is usually obtained through rule application. Thus, ML is often seen as a solution to the data acquisition bottleneck.

We focus on three main areas at the very heart of textual information access, and in which ML plays an important role. The first one deals with the automatic acquisition of relevant textual units and their typing according to a given set of types. The second one deals with the automatic acquisition of relevant links between these units. Lastly, the third one deals with the general problem of text categorization and clustering.

  • An ML approach to extraction and typing requires the developer/user to provide only (some) examples of extracted and typed data, while the rule induction task is delegated to a computer program. Our ML approach combines statistical evaluation and grammatical induction. Statistical evaluation is used to filter hypotheses, while grammatical induction methods are mainly used in wrapper generation. We see the latter problem as the inference of regular/context-free languages/transducers from sample strings or translations.

  • Our approach to the problem of linking textual units is based on the assumption that interesting relations can be acquired from texts (certain patterns are for example characteristic of the "is_a" relation). It relies on the combination of linguistic processing --to derive interesting features-- and statistical induction --to infer relations from the features extracted. On the linguistic side, chunkers and parsers constitute the adequate (i.e. robust, sufficiently precise) tools required. On the learning side, statistical methods, such as probabilistic latent semantic models, allowing inference of new concepts and links between them are required.

  • Our current capabilities in document categorization and clustering will be extended in the direction of the use of Support Vector Machines (SVMs) and other kernel-based techniques. SVMs search for an optimal hyperplane linearly separating positive examples from negative ones. This linear separation usually takes place in a high-dimensional space, obtained through a mapping (via a class of functions called Kernel fuctions) of the input space. Kernel methods are probably the most promising innovative approach for measuring similarities between documents, and can deal in a uniform way with the different components of a multimedia document.

 

RECENT EXTERNAL PROJECTS

  • KerMIT (Kernel Methods for Image and Text classification, clustering, ranking and filtering) concerns the development of algorithms and software for the classification, clustering, ranking and filtering, both in an online and offline setting, of digital documents.


  • MUCHMORE (Multilingual Concept Hierarchies for Medical Information Organisation and Retrieval) Transatlantic project (NSF/EEC) involving: CMU-LTI (USA), Stanford-CSLI (USA), DFKI (Germany), Zinfo (Germany), EIT (Switzerland) and XRCE (France). This project aims to develop technologies and a prototype system for multilingual information organization and access for the medical domain. XRCE will focus on terminology extraction from parallel and comparable corpora.

  • "Alliances" - Project funded by the French Ministry of Research and Technologies, involving: LIMSI (Orsay), LIP6 (Paris), la Fondation Charles Léopold Mayer pour le Progrès de l'Homme (Paris), and XRCE (Grenoble). This project aims at developing tools to help navigate in a collection of documents and help identify agreements and disagreements between authors.

  • TransType2 (Computer Assisted Translation): an RTD project funded by the European Commission under the IST Programme.
    The aim of TT2 is to develop a Computer-Assisted Translation (CAT) system that will help solve a very pressing social problem: how to meet the growing demand for high-quality translation. The innovative solution proposed by TT2 is to embed a data driven Machine Translation (MT) engine within an interactive translation environment.
    TransType2 partners

 

PEOPLE

Team Members: Associates: