Document Content Laboratory

Manager: Michel Gastaldo

Taking a general point of view, a “document” can be considered as a container of information. The nature and content of information in documents communicated in business processes and in documents that are openly available as Internet web pages is largely unstructured, and is captured in a variety of media, namely, textual, images, audios, and videos. Moreover, the majority of content being published on the Internet is now in languages other than English, and business communication in multiple languages is a ubiquity due to increasing globalization.

The mission of the Document Content Lab is to organize and make sense of this abundance of information by creating Smarter Document ManagementSM technologies, and make them available as services to Xerox’s customers to support business process automation, information retrieval and categorization, and situation assessment and decision making.

The lab is organized in three areas: Parsing and Semantics is automatically making sense of electronic documents by semantically analyzing them, Machine Learning for Document Access and Translation is automatically categorizing and clustering text whilst also breaking language barriers via machine translation, and Textual and Visual Pattern Analysis is creating the technology that makes everyday interaction with visual content simple and effective.

The areas have successfully transitioned multiple technologies to Xerox business groups – text/image categorization, event extraction, language identification, on-line translation, and information retrieval, just to name a few. The researchers in the lab have varied background and experience, which includes statistical theory, natural language processing, machine learning, computational artificial intelligence, knowledge representation, image processing and computer vision, visual aesthetics, and optimization techniques. Our researchers regularly publish in world-class scientific journals and conferences, and have won several awards in competitions in parsing and classification.