Comparison of Data Selection Techniques for the Translation of Video Lectures

Joern Wuebker, Hermann Ney, Adrià Martinez-Villaronga, Adrià Giménez, Alfons Juan, Christophe Servan, Marc Dymetman, Shachar Mirkin
For the task of translating scientific video lectures from English into French, we perform a qualitative and quantitative comparison of several data selection techniques, based on cross-entropy and infrequent n-gram criteria. In terms of BLEU, a combination of translation and language model cross-entropy achieves the most stable results. As another important criterion for measuring translation quality in our application, we identify the number of out-of-vocabulary words. Here, infrequent n-gram recovery shows superior performance. Finally, we combine the two selection techniques in order to benefit from both their strengths.
AMTA, Vancouver, Canada, October 22-26, 2014.


2014-014.pdf (242.14 kB)