节点文献

文档级统计机器翻译的研究

Research on Statistical Machine Translation at Document Level

【作者】 贡正仙

【导师】 周国栋;

【作者基本信息】 苏州大学 , 计算机应用技术, 2014, 博士

【摘要】 机器翻译是自然语言理解中的一个研究热点,能有效地促进信息共享,具有广泛的研究和应用价值。统计机器翻译(Statistical Machine Translation,SMT)是目前主流的机器翻译技术,但孤立进行句子翻译的SMT系统在翻译的过程中仅能利用当前句子的信息,完全忽略了前后句子的关联和文本的全局信息。然而,文档级别的信息,比如风格、主题、类别等,对机器翻译而言是极为有用的,它们不仅能引导翻译系统在词形、词义上进行正确的消歧,还能保持译文与原文在语言风格和关键内容上的一致。虽然早在1992年就有学者提出基于篇章的机器翻译概念,但机器翻译发展至今,绝大多数还停留在孤立的句子层面,这里面有一些客观的原因,比如语料的限制。但总体来看,进展的缓慢恰恰表示这项研究具有挑战性。本文对文档级统计机器翻译展开研究,主要内容包括:1.文档级统计机器翻译系统架构的研究。通过借鉴人类翻译者的活动过程,本文首先提出了基于多策略缓存的文档级SMT系统框架,该框架包括三类缓存,分别用以刻画文档的背景知识、主题和词汇衔接性。上述缓存中的各类信息被巧妙地设计成为SMT对数线性模型中的系统特征,能够指导传统的SMT系统灵活、有效地使用文档级知识。第二种架构是基于N-best列表的后处理方式,重在解决译文与原文的主题内容的一致性,其主要原理是借鉴文本摘要的生成方法,联合主题模型在N-best列表中选择一组更符合原文内容的翻译假设集合。实验表明,这两套系统架构都能够成功集成文档信息,第一种系统架构更具优势,其性能显著优于传统的句子级SMT系统。2.文档级统计机器翻译中的时态研究。时态研究是文档级SMT研究中的一个有效知识扩充,它建立在基于缓存的系统框架之上,能在翻译系统中融入更多的上下文知识。本文利用时态在文档内的延续性,提出了N-gram时态模型,该模型能反映句子内部和句子之间两个层次的时态变化规律。在此基础上,本文又提出了更具泛化能力的基于分类的时态模型。实验表明,两种时态模型各有优势,联合了时态模型的SMT系统能够显著改善翻译质量,最好的系统性能在BLEU值上提高0.97个点。3.文档级机器翻译自动评价方法的研究。本文主要从两个方面探索了文档级翻译的自动评价方法:第一,根据译文需要反映原文的关键内容出发,分别提出了中心句驱动的评价方法和基于主题模型的评价方法;第二,从文档级翻译需要保持词汇衔接性出发,提出了基于词汇链的评价方法。相关实验表明改进后的评价方法能不同程度地提高与文档级人工评分的相关系数。上述三个方面构成了此项研究的一个有机整体,比较全面地涵盖了文档级SMT亟待处理的几个核心问题。目前国内外的相关研究尚处于起步阶段,本文的研究亦属于探索性工作,上述研究内容创新性明显,相信会对今后的相关研究提供重要的参考价值。

【Abstract】 Machine translation is a hot research topic in Natural Language Understanding. It caneffectively promote information sharing and thus has wide application and research value.Statistical Machine Translation (SMT) is the mainstream of machine translationtechnology in recent years. However, most of SMT systems translate documents sentenceby sentence under strict independence assumptions. Therefore they only utilize limitedsentence context while completely ignore the relationship between sentences and globalinformation of text. Nevertheless, the characteristics of text, such as style, subject andgenre, can serve to disambiguate word sense, keep consistent language style, andespecially convey key information of original texts during translating procedure.The idea of doing machine translation in discourse unit was early put forward in1992,however, most of machine translation systems still work at isolated sentence level. Thereasons are manifold, such as lack of document information in parallel corpus. But slowresearch progress just shows this is a tremendously challenging task. The main content ofthis dissertation includes:1. The research on designing reliable frameworks for document-level SMT.In order to closely simulate human translation process, we first present a cache-baseddocument-level SMT system. These caches fall into three categories and can describe thefollowing text characteristics, background, topic and lexical cohesion respectively.Furthermore, three kinds of feature for SMT log-linear model are designed to utilizeinformation in these caches. Our proposed framework can guide traditional SMT systemsto effectively use document-level knowledge. The second framework is based on N-bestlist produced by SMT system, so we call it as a post-processing procedure. The point ofthis way is to control consistency of topic models between source-and target-side texts.Inspired by the idea of extractive summarization, such system generates final hypothesis collection by dynamically selecting translation hypothesis from N-best list underconsistency assumption of topic model. Both of these frameworks can successfullyintegrate document-level knowledge into SMT systems, and the former can achieve moresignificant improvements according to the experimental results.2. The research on tense model for document-level SMT.Tense research is an effective knowledge expansion of document-level SMT. Thetense model is working on our cache-based SMT system and can integrate rich knowledgeof context. According to temporal continuity in one document, this paper puts forwardN-gram-based tense model, which can reflect tense variation of inter-sentences and intra-sentences. Furthermore, this paper proposes a classifier-based tense model which has moregeneralization abilities. Experiments show the joint of SMT and tense model caneffectively improve translation quality and the best SMT system can be improved0.97percent in BLEU score.3. The research on automatic evaluation metrics for document-level SMT.Translation results should reflect main content of original texts, so we first propose atopic-sentence-driven evaluation metric and a topic-model-based evaluation metricrespectively. Second, document-level translation should keep lexical cohesion and thus anevaluation metric based on lexical chain is proposed. Experimental results show ourproposed evaluation metrics can improve Spearman correlation to human assessments.This dissertation has a comprehensive coverage of core issues of document-levelSMT. Currently the related research at domestic and abroad is still in its infancy. Theresearch work has great innovation in SMT and exhibits a great reference value to thefuture research in document-level SMT.

  • 【网络出版投稿人】 苏州大学
  • 【网络出版年期】2014年 09期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络