节点文献

统计机器翻译中树到串对齐模板模型系统实现和比较研究

Implementation and Analysis of Tree to String Alignment Template Model in Statistical Machine Translation

【作者】 张春越

【导师】 赵铁军;

【作者基本信息】 哈尔滨工业大学 , 计算机科学与技术, 2010, 硕士

【摘要】 统计机器翻译使用统计方法自动地把一种自然语言的文本转换成另一种自然语言的文本。最近,统计机器翻译研究者开始关注融合语言学信息的翻译模型。在这些模型中,基于树到串对齐模板的翻译模型是一种很好的代表。首先,本文对受句法指导的树到串对齐模板模型进行了较为全面的论述,并实现了基于树到串对齐模板模型的解码器。详细讨论了树到串对齐模板模型的形式化定义、参数估计和解码方法。同时,为了加速树到串对齐模板模型的解码速度,使用了立方体剪枝策略。其次,对树到串对齐模板模型进行了实证分析。将树到串对齐模板模型和短语模型在三个方面上进行了详细地对比。第一,树到串对齐模板模型的生成能力更强,能够表达语言中常见的非连续搭配问题。第二,树到串对齐模板模型在处理长距离调序问题上比短语模型更有优势。第三,树到串对齐模板模型不能表达非句法连续短语。最后,使用Moses做为对比系统在NIST-2005和NIST-2008 MT测试集上对解码器进行了实验验证。最后,对基于统计方法的音译汉英外国人名进行了探索。第一,讨论了常见的统计音译方法分类,详细介绍了基于序列化标注模型和基于噪声信道模型的两种音译模型。第二,通过充分的实验比较得出结论:对基于噪声信道方法的音译模型而言,汉语应该以汉字为基本单位,通过音节化英文人名能够在低阶语言模型上获得更好的翻译性能。第三,通过重排序的方法可以极大地提升模型的性能。

【Abstract】 Statistical machine translation is the task of automatically translating a text from one natural language into another by using statistical methods. Currently, linguistic-based translation model has become a dominant issue by more and more statistical MT researchers. Among many existed linguistic models, tree to string alignment template model is a classical representative.In this thesis, firstly we describe in detail tree to string alignment template model, which is directed by linguistic syntax, from formal definition, free parameters estimation to decoding method. We implement a decoder with respect to the model. In order to accelerate the decoding speed, we use the cube-pruning method to prune hypothesises, so time cost of decoding is decreased significantly.Secondly, we compare tree to string alignment template model with phrase model on 3 points as follows. Tree to string alignment template model has better generation ability than phrase model, especially on exploiting non-continuous custom collocation. And tree to string alignment template model can reorder long distance distortion better. Although tree to string alignment template model has many advantages compared with phrase model, it can not express continuous non-syntax phrase. At last we get our decoder’s performance on NIST 2005 and NIST 2008 MT evaluation set with Moses as a baseline system.Finally, statistical-based transliteration is discussed on Chinese to English person name. We classify the-state-of-art statistical-based transliteration method, and introduce two transliteration models: sequence label-based transliteration model and noisy channel-based transliteration model. According to sufficient experiments, we get some useful conclusions as follows: in noisy channel-based transliteration model, the basic unit of Chinese should be Chinese character and syllable-English sequence can improve significant performance under the condition of low-order language model. We can get better performance with reranking method.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络