节点文献

基于双语语料库的机器翻译关键技术研究

Research on Bilingual Corpus-Based Machine Translation

【作者】 巢文涵

【导师】 李舟军;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2008, 博士

【摘要】 机器翻译的研究由来已久,但尚未能完全达到人类期望的目标。随着计算机软硬件技术的高速发展,以及语料库建设的完善,利用统计知识的机器翻译成为可能,翻译质量有望离人类的期望更近一步。自噪声信道模型,尤其是最大熵模型提出以来,机器翻译的一个中心任务是如何在模型中融入更有效的知识(特别是语言学知识),以进一步提高机器翻译的质量。本文聚焦于中文-英文之间的机器翻译问题,针对如何有效地在基于中英双语语料库的机器翻译中结合句法知识进行了一系列系统、深入的研究,并形成了一套完整的系统。具体来说,本文包括以下工作:1.提出了一种基于句法知识的词对齐模型及方法。词对齐是统计机器翻译的基础,词对齐的质量将会最终影响到机器翻译的质量。针对中英文之间词对齐的困难,本文提出一种词对齐改进模型,在词对齐过程中引入句法知识,以解释中-英词对齐之间复杂的词序变化。本文首先将反向转录文法(ITG)内隐式的约束转换成显式的位置判断,从而可以有效地将ITG模型引入对数线性词对齐模型。同时,设计了句法分析树与ITG之间的相似度度量,将句法分析树的约束融入到基于ITG的词对齐模型中。通过整合两种类型的句法知识,使得可以对词对齐中的词序变化进行更好的约束。2.提出了一种树-树映射的统计机器翻译模型及方法。由于源句子和目标句子的词序差异,重定序(Reordering)处理翻译过程中目标词顺序的变化,它是统计机器翻译(SMT)过程中需要面对的难题之一。本文提出一种树-树映射的统计机器翻译模型,通过在源句子的句法树与ITG树之间进行映射,实现在全局范围内约束目标短语的顺序变化;同时模型中包含了基于ITG的局部重定序模型特征,通过将两个块的方向预测分解成对两者相邻子块的方向预测,从而能够预测任意长度的两个块之间的翻译方向。局部模型与全局模型的集成,有效地解释了源句子与目标句子之间的复杂关系。3.给出了一种基于双语信息的相似实例检索方法。基于实例的机器翻译(EBMT)采用类比的原理进行翻译,在给定相似实例的条件性,能够产生流畅的译文。因此,如何在大规模的实例库中检索出相似实例,对于EBMT的质量具有重要意义。本文提出一种新颖的相似实例检索方法,利用实例中的词对齐信息,设计了一系列相似度度量,用于计算输入的待翻译句子与训练语料库中实例的相似度,提高了检索的质量;同时,为加快检索的速度,设计了一个双层倒排索引表,提高了检索的效率。4.提出了一种基于实例的统计机器翻译模型及方法。前文提出的树-树模型是从源句子的角度出发,尽量确保生成的译文结构满足与源句子句法树的约束关系。因此,它无法保证目标句子结构的合理性。本文提出一种混合模型,该模型是对树-树模型的扩展,在SMT中结合实例知识,以保证译文的结构合理性以及流畅性。同时,给出了一个基于实例的解码器,它结合统计知识以及实例信息,以提高解码的质量和效率。

【Abstract】 The research on machine translation has lasted a long time, but the quality has not reached the goal that the human beings have expected. However, with the rapid development of the computer technologies, and the improvement of the corpus construction, the machine translation based on the statistical knowledge becomes possible, and the quality of translation has the chance to get closer to the expectation of human beings. Since the noisy channel model, especially the maximum entropy model, for the machine translation have been proposed, one of the central tasks is to integrate more useful knowledge, especially linguistic knowledge, to improve the translation quality further. This paper focuses on the machine translation between the Chinese-English texts. And we make an in-depth and systematical research on how to incorporate the syntactic knowledge into the bilingual corpus-based machine translation , and implement a complete system in the end. In detail, the paper consists of the following topics:1. We propose a syntax-based word alignment.Word alignment is the base of the statistical machine translation, and its quality will take great effect on the quality of translation. Considering the problems faced in the Chinese-English word alignment, we propose an improved word alignment model, which introduces the syntactic knowledge to explain the flexible word order within the word alignment.By transforming the constraints, which is contained in the inversion transduction grammar implicitly, into some explicit position judgments, we introduce the ITG into the log-linear word alignment model in an effective way. Also, after designing some similarity metrics between the syntactic tree and the ITG tree, we integrated the syntactic knowledge into the ITG-based word alignment model, so that the model can constrain the complex word order within the word alignment.2. We propose a tree-tree statistical machine translation model.Because the word order is different between the source sentence and target sentence, one of the problems that should be solved in the SMT is the reorderings of the target words.We present a tree-tree SMT model in this paper. By mapping between the syntactic tree and the ITG tree, the model limits the reordering of the phrases in the global scope. While in the local scope, the tree-tree model takes an ITG-based local reordering model as one feature, in which the reordering probability of two blocks is decomposed into the product of the reordering probabilities of the child blocks respectively. So the model is able to estimate the reordering of two blocks with arbitrary lengths. By combining the global and local reordering model, the tree-tree model is able to explain the complex relationship between the source and target sentences.3. We propose a similar example retrieval approach based on bilingual information.When given similar translation examples, the example-based machine translation (EBMT) system will generate fluent translation. Thus, it is very important for the EBMT to retrieve the similar examples in the large scale corpus.In this paper, we propose a novel retrieval approach, which makes good use of the word alignment knowledge within the examples. In order to measure the similarity between the input sentence, which should be translated, and a translation example, we design a series of similarity metrics based on the word alignment within the example. These metrics improve the quality of retrieval. Also, we design a two-level inverted index table, to improve the efficiency of retrieval.4. We propose an example-based statistical machine translation model.The tree-tree SMT model above considers the source sentence only, and it tries to make the translation satisfy with the syntactic tree of the source sentence. So, it is unable to ensure that the structure of the target sentence is reasonable.We present a hybrid machine translation model, which expands the tree-tree model, combining the example knowledge into the SMT, to ensure the translation’s fluency and consistency. In the same time, we present an example-based decoder, which makes use of both of the knowledge within the translation examples and the statistical knowledge, to improve the quality of translation.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络