Combining NLP and probabilistic categorisation for document and term selection for SWISS-PROT Medical annotation
Pavel Dobrokhotov, Cyril Goutte, Anne-Lise Veuthey, Eric Gaussier
Motivation:Looking for relevant publications for manual database annotation is a tedious task. In this paper, we
show that the combination of natural language processing (NLP) qnd clqssificqtion tools cqn help re-ranking
the documents returned by PubMed according to their relevance to SWISS-PROT annotation.
Results:With q probabilistic latent categoriser (PLC)we obtained 69% recall and 59% precision for relevant
documents in representative query. As the PLC technique provides the relative contribution of each term to
the final document score, we used the Kullback-Leibler symmetric divergence to determine the most
discriminating words for SWISS-PROT medical annotation. This information should allow curators to better
apprehend classification results and has also a great value for fine-tuning the linguistic pre-processing of
documents, which in turn can improve the overall classifier performance.
Proceedings of the 11th International Conference on Intelligent Systems for Molecular Biology (ISMB 2003)
dobrokhotov03combining.pdf (383.27 kB)