Litigation e-discovery

In cases of litigation or government investigations, massive volumes of data are collected during the electronic discovery process (known as 'eDiscovery'). Litigants must identify documents relevant to the case in matter and produce them either to the opposing counsel or in response to a government subpoena.

The biggest costs in eDiscovery are typically the review costs, whereby teams of lawyers review and select documents for 'production' or "privilege" i.e. documents that may be subject to the attorney-client privilege or attorney work product protection.

Using keyword searches and their variants to identify potentially responsive data has become standard procedure in document reviews. However it provides neither a sufficient nor a defensible mechanism for retrieving an acceptably complete set of potentially responsive documents; partly because it leaves the burden on the legal teams to conceive, a priori, all possible search terms that might retrieve responsive material.

Consequently, and given the ever-expanding volumes of data subject to eDiscovery, more advanced search technologies have been put forward (e.g. “concept search”), and courts have allowed the use of  Technology Assisted Review (TAR) technologies (also known as "Predictive Coding").

Machine learning text classification is one of such technology: classifier models are built by having statistical algorithms “learn” from a set of pre-classified (labeled) training documents provided by a subject matter expert. The statistical models can then be used to generate scores that predict the likely categories of new (unlabeled) documents from the whole review population.

Our research activities in the field of litigation consist of developing Machine Learning  techniques to support document reviews. We do not aim to automate the review as attorneys who are most knowledgeable about the case still drive the review but they use the classifier outputs and document ranking. The main benefits of using these statistical methods are improved accuracy and consistency of designations, reduced review costs, increased speed of reviews and enhanced defensibility.

Our research addresses the challenges of Big Data in litigation. We design large-scale algorithms for runtime classification, training, clustering, large-scale search by similarity and Active Learning.  All these algorithms need to be able to deal with the millions of documents that constitute a single case.


Read the case study  of how CategoriX  from Xerox Research Centre Europe reduced the number of documents that had to be manually reviewed by 86% from 30 million to 4.1 million.