节点文献

非连续短语模板抽取及短语合并在统计机器翻译中的应用

Discontinuous Phrase Template Extraction and Phrase Combination in Phrase-Based Statistical Machine Translation

【作者】 段楠

【导师】 何丕廉; 李沐;

【作者基本信息】 天津大学 , 计算机应用技术, 2007, 硕士

【摘要】 机器翻译(MT)就是利用计算机将一种自然语言的文本或对话转换为另一种自然语言的文本或对话,同时保持语意的一致性。在给定源语言的情况下,机器翻译的过程就是寻找与源语言在语意上最为匹配的目标语言的决策过程。在各种不同的机器翻译系统中,基于短语的统计机器翻译(Phrased-Based SMT)无疑是最为有效的方法。基于短语的统计机器翻译方法允许源语言和目标语言词语之间存在多对多的关联,从对齐矩阵中抽取出来的短语被放置在短语翻译表中。这样,词语的上下文信息在翻译模型中就可以被考虑进来,并且在把源语言翻译成目标语言过程中所发生的单词之间位置顺序的改变也可以显式的获得。在汉-英机器翻译系统中,基于短语的统计翻译模型较之单纯基于单词的统计翻译模型,翻译效果有着显著的提高。但是,这种方法同时也存在着一些问题。由于短语长度的限制,一些在中文中间隔较远的固定结构并不能被完整的抽取出来。这些结构在中文句子中不连续,而其对应翻译却在英文句子中连续。并且,对短语各个部分分别进行翻译拼凑起来的结果并不等价于将其做为一个整体翻译而获得的结果。本文通过在短语翻译表中加入非连续短语模板和短语合并项来增强机器翻译的效果。短语模板抽取和短语合并过程并不涉及任何的语法信息,仅仅只是从双语对齐语料中获得。本文将简要的介绍抽取和合并的算法细节,并以BLEU做为翻译结果的评测标准,在2002年至2005年NIST (National Institute of Standards and Technology)标准测试语料集上进行对比实验。实验结果表明,加入短语模板和短语合并项后,翻译质量与先前系统相比有了一定程度的提高。

【Abstract】 Machine Translation (MT) is the use of a computer to translate texts or ulterances of a natural language into another natural language while maintaining the meanings unchanged. The process of MT is a decision problem where we have to decide on the best of target language text matching a source language text. During various kinds of different MT systems, Phrase-Based Statistical Machine Translation (SMT) is the best one undoubtfully.The Phrase-Based SMT approach allows for general many-to-many relations between words. Phrases which are extracted from alignment matrixs are listed in phrase translation table. Thereby, the context of words is taken into account in the translation model, and local changes in words order from source to target language can be learned explicitly. On the Chinese-English translation task, the Phrase-Based SMT obtains significantly better performance than the Single-Word-Based one.However, this approach also has some shortcomings at the same time. Due to the restriction of the allowed maximum length of a Chinese phrase, some fixed structures which are separated in a relative long distance can not be extracted as a whole unit. These structures devide in Chinese but their translations are continuous in English. What’s more, the union of each part’s translation is unequal the one which is obtained by translating the structure as a whole unit.We add discontinuous phrase templates and merged phrases in phrase translation table to enhance the quality of the Phrase-Based SMT. Extracted templates and merged phrases are learned from a bitext without any syntactic information. In this paper, we will introduce the algorithms of extraction and combination in details and take a series of comparative experiments using BLEU as a metric in 2002-2005 NIST test data. The evaluation results show that the quality of the translations achieves a relative improvement over the baseline Phrase-Based SMT.

  • 【网络出版投稿人】 天津大学
  • 【网络出版年期】2009年 04期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络