Publications
Authors:
  • Sara Stymne , Nicola Cancedda
Citation:
EMNLP, 6th Workshop on Statistical Machine Translation, Edinburgh, UK, July 30-31.
Abstract:
In many languages the use of compound
words is very productive. A common practice
to reduce sparsity consists in splitting compounds
in the training data. When this is done,
the system incurs the risk of translating components
in non-consecutive positions, or in the
wrong order. Furthermore, a post-processing
step of compound merging is required to reconstruct
compound words in the output. We
present a method for increasing the chances
that components that should be merged are
translated into contiguous positions and in the
right order. We also propose new heuristic
methods for merging components that outperform
all known methods, and a learning-based
method that has similar accuracy as the heuristic
method, is better at producing novel compounds,
and can operate with no background In many languages the use of compound
words is very productive. A common practice
to reduce sparsity consists in splitting compounds
in the training data. When this is done,
the system incurs the risk of translating components
in non-consecutive positions, or in the
wrong order. Furthermore, a post-processing
step of compound merging is required to reconstruct
compound words in the output. We
present a method for increasing the chances
that components that should be merged are
translated into contiguous positions and in the
right order. We also propose new heuristic
methods for merging components that outperform
all known methods, and a learning-based
method that has similar accuracy as the heuristic
method, is better at producing novel compounds,
and can operate with no background In many languages the use of compound
words is very productive. A common practice
to reduce sparsity consists in splitting compounds
in the training data. When this is done,
the system incurs the risk of translating components
in non-consecutive positions, or in the
wrong order. Furthermore, a post-processing
step of compound merging is required to reconstruct
compound words in the output. We
present a method for increasing the chances
that components that should be merged are
translated into contiguous positions and in the
right order. We also propose new heuristic
methods for merging components that outperform
all known methods, and a learning-based
method that has similar accuracy as the heuristic
method, is better at producing novel compounds,
and can operate with no background linguistic resources.
Year:
2011
Report number:
2011/015