节点文献

统计机器翻译判别式训练方法研究

Research on Discriminative Training Methods for Statistical Machine Translation

【作者】 刘乐茂

【导师】 赵铁军;

【作者基本信息】 哈尔滨工业大学 , 计算机科学与技术, 2013, 博士

【摘要】 过去二十多年,统计机器翻译取得了很大的成功;但是它还远不能满足人们的需求,它仍然需要进一步的发展和改善。在当前的形势下,从数学模型的角度来看,统计机器翻译的一个发展趋势就是,从少特征、小模型到多特征、大模型的过渡;从线性到非线性模型的演变。按照翻译模型这个发展趋势,本文从目前最主流的对数线性翻译模型出发,以判别式训练作为主要线索,主要研究了如下四个方面的内容:(1)对于含有少数特征的对数线性模型,现有最成功的判别式学习算法MERT遭遇了不稳定性。由于在每次优化步k-best翻译列表会发生变化,这就意味着定义在k-best翻译列表之上的优化目标函数会发生变化,从而引起优化权重的“震荡”现象,同时引起了MERT的不稳定性。本文在设计判别式训练的优化目标时,采用了极端保守更新的思想来抑制优化权重的“震荡”现象,提出了基于极端保守更新的最小期望错误率训练方法。该训练方法采用基于梯度投影的学习算法来实现,因而它的实现比MERT简单。实验表明,这个训练算法的性能比MERT更好。(2)对于含有大规模稀疏特征的翻译模型,虽然现有的可扩展的训练方法从训练效率上来说,能够勉强运用于训练这样的翻译模型,但是这些训练方法由于遭遇严重的特征稀疏性而导致翻译性能不佳。本文就特征稀疏性,研究了两个实用的应对技术-扩大开发集和L1正则,不过由于一些其他的原因,这两个技术并不足以解决特征稀疏这个问题。为此,本文提出了一个基于OSCAR的自动特征分组的训练方法。为了有效地学习特征的分组结构,本文提出了一个在线学习方法。实验结果表明,这个训练方法取得了比现有方法更好的性能。(3)基于对数线性模型的所有现有训练方法均存在如下两个不足:首先,它们的性能严重依赖于开发集的选择,而通常适合测试任务的开发集往往很难获得,这样容易导致由于采用了不合适的开发集进行训练,测试性能很差;其次,这些训练方法都是针对给定的开发集,训练出一个权重,而这个权重不能保证所有测试句子翻译结果的一致性。为了解决这两个问题,本文提出了一个局部训练的方法,与现有的方法明显不同,它为每个测试句子训练一个权重。局部训练方法的一个瓶颈是训练效率问题,本文提出了一个增量式的训练方法来克服这个瓶颈。需要强调的是,从测试时的决策函数来看,局部训练方法对应于一个非线性翻译模型。(4)基于对数线性的翻译模型,在建模翻译现象时,存在如下两个局限性:它严格要求特征同模型函数之间的线性关系,容易引起建模的不充分;不能对于其中的表面特征进行进一步的抽象和解释。采用神经网络对于翻译进行建模是缓和上述问题的一个潜在途径,一方面,神经网络可以突破线性的限制,能够逼近任何的模型函数,因而建模更充分;另外一方面,它通过引入隐含单元,可以对输入的表面特征进行抽象和解释。不过,如果将翻译的建模同它的解码联合在一起进行考虑的话,经典的神经网络由于它的一些特性,会遭遇严重的解码效率问题。为了解决这个问题,本文提出了一个变化的神经网络-可加型神经网络,来对翻译进行模型,同时本文为基于可加神经网络的翻译模型提出了一个有效的训练方法。

【Abstract】 Over the last two decades, statistical machine translation (SMT) has achievedgreat successes; nevertheless, it is still far away from the human being’s requirementsand thus needs further development and improvements. In the current situation, fromthe view of mathematical models, one of potential directions for SMT is the transi-tion from a few features and small models to many features and large models, and thetransformation from linear models to nonlinear models. Under this research direction,this paper starts from the log-linear based translation model which is the most pop-ular model for SMT, and mainly investigates the following contents focusing on thediscriminative training.(1)Forthelog-linearbasedmodelconsistingofafewfeatures,themostsuccessfultuningmethod,MERT,suffersfromalimitationofinstability. Sinceak-besttranslationlist always changes at each optimization step, which means the variant of optimizationobjective defined over a variant of k-best translation list, the shake phenomenon ofoptimized weights incurs and this induces the instability of MERT. This paper employsthe idea of ultraconservative update when designing the optimization objective, andproposesanewtuningmethodcallederrorrateminimizationbasedonultraconservativeupdate. Experiments show that its performance is better than that of MERT.(2)Forthelog-linearmodelconsistingofalargescaleofsparsefeatures, althoughexisting tuning methods can be used to tune such a translation model from the view oftuning efficiency, its performance is limited due to feature sparsity. This paper con-siders two practical techniques, i.e. enlarging a tuning set and L1regularization, andshows these two techniques are not sufficient due to some other reasons. Therefore,it proposes a novel tuning method based on automatic feature grouping to relieve fea-ture sparsity. In order to learn feature group structure efficiently, it also investigatesan online learning method. Experiments show that this tuning method outperforms theexisting tuning methods.(3)Existingtuningmethodsforthelog-linearmodelusuallysufferfromtwoshort-comings. Firstly, theirperformanceishighlydependentonthechoiceofadevelopmentset, but usually the suitable development set is not available and not simple to create,which may potentially lead to an unstable translation performance for testing because of the difference between the development set and a test set. Secondly, they try tooptimize a single weight towards a given development set but this weight cannot leadto consistent results on the sentence level. To overcome these two shortcomings, thispaper proposes a local training method, which tunes many weights, each one for eachtest sentence, and thus is different from these existing methods. The bottleneck of localtraining is its training efficiency, and thus this paper also proposes an efficient incre-mental training method. Please note that according to decision function for testing thelocal training method works like a nonlinear model.(4) When modeling translation phenomenon, the log-linear model has two limita-tions: its features are strictly required to be linear with respect to the objective and thismay induce modeling inadequacy. In addition, it cannot deeply interpret and representits surface features. A potential solution to address these limitations is modeling withneural networks. On one hand, neural networks can go beyond the linear limitation anditactuallycanapproximatearbitrarycontinuousfunctions. Inotherwords, theirmodel-ing will be more adequate. On the other hand, they can represent their surface featuresby using hidden units. However, classical neural networks will be challenged by thedecoding efficiency due to their inherent characteristics when modeling and decodingare considered together. Therefore, this paper proposes a variant neural network calledAdditive Neural Network for machine translation, and investigates an efficient methodfor its discriminative training.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络