Home page Site map Contact
   

 

LARGE SCALE DATA MINING

OVERVIEW :

XRCE's Large Scale Data Mining research area is the point of reference for new data mining algorithms within Xerox. We focus on learning and inference from high-dimensional data that is heterogeneous and evolving over time. Our algorithms have been applied to mining logs from devices such as networks and printers, enabling behavior prediction, optimization, visualization and diagnostics. We also produce text categorization and clustering tools that have been applied in numerous business settings ranging from the filtering and routing of mails to the analysis of Voice of Customer and survey data. Scientifically speaking, our text analysis tools are unique in the sense that they operate with a single model unifying both categorization and clustering.


ACTIVITIES :

Print Infrastructure Mining

We create software that manages infrastructures involving hundreds of printers. By applying data mining tools to print data we help administrators identify patterns of usage, detect abnormal behaviors and optimize the position or the settings of the printers.

Device Log Mining

Like aircraft, high-end printers are exceptionally complex to monitor and maintain. We tackle this challenge with dynamic Bayesian networks applied to huge volumes of sensor data from our worldwide printer fleet.
We provide tools for prognostics, diagnostics, troubleshooting, preventive maintenance and fleet health visualization.

Text Categorization and Clustering

We design text analysis tools that tackle multiple practical business issues, including: hierarchical aspects (tree-structured taxonomies and multi-level language models), dynamic collections (emergence of new topics, vocabulary drifts), mixed data (textual / quantitative) as in survey analysis, coping with noise, fuzziness and uncertainties in documents (such as those resulting from OCR/translation/speech-to-text processes), and reducing the annotation burden by optimally combining active and semi-supervised learning.

Hybrid Text-Image Information Access

Multimedia information access (categorizing/clustering multimedia documents, querying a multimedia database) raises the problem of designing algorithms able to fill the gap between different media, by providing "translational" links and exploiting cross-media information. These algorithms are applied to multi-media tasks such as automatic image annotation, automatic text illustration, cross-media categorization and searching an image database with text queries.

Past Projects

Contact us

People

XRCE Publications database