About XRCE

PDF version of the article

New text mining capabilities open the door to Big Data analytics 

Health care is a very large domain that covers many disciplines ranging from biomedical research to applied clinical care. While every discipline has its own specific requirements, they all share something in common: a reliance on text for sharing information. While new forms of communication and information sharing are emerging through the exploitation of electronic databases and IT infrastructures, text and textual ‘documents’ remain the most used and preferred channel for humans to exchange information. In the medical world, this text resides in scientific publications, clinical guidelines, clinical trials, patient records, clinical notes and so on. These documents may be handwritten, typed, semi-structured or completely unstructured. 

Although preferred, the use of text has a number of drawbacks. First of all text tends to be ambiguous. When someone says “orange” they may refer to the colour, the fruit, network provider or (in France) the town. Secondly, when compared to databases where each piece of information is carefully defined, formatted and structured, text is less constrained and defined as ‘unstructured’. Even after organization into chapters, sections and tables, text remains far less structured than databases. This may not be problematic in hard copy format like a book, but it quickly becomes an issue when we want to search and use the knowledge that lies in such free text. Even when you want to go beyond simple document retrieval using keyword search you need powerful linguistic tools. And dealing with medical textual information goes even beyond being able to process the text. It needs to integrate and be compliant with a wide range of medical standards and terminologies used in the profession such as ICD-10 (International Statistical Classification of Diseases and Related Health Problems) or SNOMED (multilingual clinical health care terminology).

As a result, the development of free text processing technologies has accelerated over the last 10 years giving birth to the creation of robust tools that are now being applied in complex and challenging fields such as medicine.

Identifying when a paradigm shift occurs

In the field of medical research, the ability to read and digest the huge amount of scientific publications that cover recent theories and experiments is a daily challenge. Experts must keep abreast of trends, state of the art methodologies and new discoveries to master the big picture of current knowledge and thinking. In genomics, progress in DNA sequencing at the beginning of the 2000s created a surge of activity in research and publications related to gene and protein interactions. The number of published results was so large that it was very difficult to capture the exhaustive list needed to build a comprehensive map of interactions. It required advanced text analytics to automatically analyze the content of scientific papers, detect references to gene protein interactions and automatically create such a map.

Similarly and more recently, new research trends have focused on phenomics which aims at discovering the links between genes and certain types of diseases. Discovering these links requires smart information extraction tools. For example, studies have demonstrated that a single gene is at the origin of certain types of diabetes which sometimes evolve into cancer. Such a link was found thanks to large epidemiological studies which focus on patient records over a long period of time to discover hidden correlations. These studies, which would have required an army of skilled experts to manually annotate relevant pieces of information, now can be conducted by smart information extraction tools to detect, collect and structure this information in an exhaustive and automatic way.

Being immediately informed of new discoveries or major changes that impact knowledge previously taken for granted is of particular importance and very challenging in the medical field due to the amount of information already available. In phenomics the interactions of genes and proteins described in databases are used in the creation of new medication and treatment. It is crucial that these databases be exhaustive, but keeping them up-to-date is extremely labour intensive. Each new publication must be read and compared with existing knowledge to check its relation and possible impact on earlier data. Information extraction technology can be used to detect the paradigm shifts expressed in the language used within the texts. It pinpoints the subtle signs in the syntax or semantics used to describe results that indicate such changes with respect to previous knowledge such as “… a new experiment has demonstrated that molecule X is no longer …” making it easier for the experts to identify new results.

In clinical guidelines, which provide medical best-practices and advice in treating specific diseases, natural language processing can be used to help keep them current. It does so by automatically mining clinical literature, and proposing changes to the experts who manage them. This helps bridge the gap between medical research and medical care. It can also be used to facilitate the enrollment of patients in trials and perform epidemiological studies. In trials, it can analyze and formalize patient eligibility criteria for each trial. It compares this analysis with patient information to identify the individuals who match the criteria or, vice versa i.e., for a given patient it identifies a suitable trial.

Analysing risk

Another important application of text analysis which benefits clinical care activity is risk assessment. Hospitals are very complex environments with multiple practices, expertise, processes and treatments. A shortage of resources, complex processes, emergencies and stress are all factors that combined, may occasionally lead to human errors that impact patient safety. To prevent or detect these errors as early as possible, a number of monitoring processes are used but the diversity of medical information channels and information systems used in hospitals makes it very difficult. As text is one of the most common channels that medical staff uses to describe the complex situations they are faced with, health security experts require tools that can monitor the wide spectrum of data floating around inside hospitals. Hospital acquired infections are a good illustration of such complexity. One of the metrics used to measure the degree of risks of these infections is the amount of soap used per medical staff per surgery. This measure can be cross-checked with actual reports of infections. To better understand the correlation, a French government funded research project was conducted to detect hospital acquired infections from patient records across a group of hospitals. The objective was to identify from patient discharge summaries any piece of evidence or event that could lead the system to identify where there might be a risk of infection. Researchers worked with the medical staff to translate their medical knowledge into natural language processing rules which could sift through the records and alert staff when an infection was suspected. The system was able to successfully detect 87 percent of infections demonstrating its relevance in accompanying qualified experts to monitor and improve patient care.

About the authors

Denys Proux is project manager of services and healthcare innovation at Xerox Research Centre Europe. He focusses on applying natural language processing and bio informatics expertise to create advanced text mining tools for healthcare.  

Caroline Hagège is a senior scientist at Xerox Research Centre Europe. She is an expert in parsing in semantics and in the last few years has been researching the applications of this technology combined with medical terminology to analyse electronic medical records and patient data.