节点文献

Web双语平行语料自动获取及其在统计机器翻译中的应用

Mining Bilingual Parallel Corpora from Web Automatically and Its Application in Statistical Machine Translation

【作者】 林政

【导师】 马希荣;

【作者基本信息】 天津师范大学 , 计算机应用技术, 2010, 硕士

【摘要】 双语平行语料库在自然语言处理领域有很多重要应用,它为统计机器翻译模型提供不可或缺的训练数据,同时也是词典编纂和跨语言信息检索等应用的重要基础资源。但是大规模双语平行语料库的获取并不容易,现有的平行语料库在规模、时效性和领域的平衡性等方面还不能满足处理真实文本的实际需要。随着互联网的普及和飞速发展,越来越多的双语网站被创建,越来越多的信息以多语言的形式发布,这就为双语和多语语料库的建设提供了很大的来源。一些研究者提出了基于Web的双语或多语平行语料库自动挖掘方法,为双语或多语平行语料库的自动构建提出了有效的解决途径。本文致力于构建一个基于Web的大规模双语平行语料库自动获取系统。取得主要成果有以下几方面:1.研究了双语混合网页的自动发现和获取互联网上的双语平行资源主要分为两类:一类是双语资源分布于两个网页间,两个网页用不同语言描述内容上是互译的,我们称之为双语平行网页;另一类是双语资源位于同一网页内,我们称之为双语混合网页。以往的系统主要是基于双语平行网页的,但是通过观察,我们发现Web上存在大量的双语混合网页,而且双语混合网页上的双语资源对照更为工整,翻译质量较高,是非常宝贵的双语资源来源。双语平行网页存在地址或结构上的相似性,处理方法已经很成熟,但这些方法并不适用于双语混合网页。候选双语混合网页分布通常不确定,缺乏一些常见的启发信息,获取更为困难。本文提出了一种基于尝试下载策略的自动发现双语混合网页的方法,运用该方法获取候选混合网站具有较高的正确率。2.研究了从双语混合网页中抽取平行句对的方法从双语混合网页中抽取平行句对的主要任务可以分成三部分:网页噪声过滤、双语混合网页确认和句子对齐。本文研究并实现了两种网页去噪声方法:专用的基于模板的方法和通用的基于Html标签树的方法。对于双语混合网页的确认本文分两步实验,分别是基于双语字符数的粗判别和基于词典的细判别。最后,本文采用基于混合信息的句子对齐方法将篇章级的双语平行文本转化成双语平行句对。本文解决了上述三个难点问题,实现了一个基于双语混合网页的平行语料自动挖掘系统。3.研究了Web双语平行语料在实际中的应用本文将从Web上获取的双语平行句对应用于统计机器翻译的模型训练,提出了句对质量排序和领域信息检索两种不同的应用策略将Web平行语料加载到训练集中,实验证明本文提出的两种策略可以提高翻译系统性能,在IWSLT评测任务中BLEU值可以提高2到5个百分点。

【Abstract】 There are many important applications of bilingual parallel corpora in natural language processing, which provides essential training data for statistical machine translation, and can be used in lexicography and cross-language information retrieval. However, access to a large-scale bilingual parallel corpus is not easy, the existing parallel corpora can not meet the actual needs in terms of the scale, timeliness and balance of the fields. With the popularity of the Internet and rapid development, more and more bilingual sites have been created, more and more information in multiple languages have been published, which can be the source of bilingual and multi-lingual corpus. Some researchers have proposed several effective solutions of Web-based bilingual or multilingual parallel corpora automatically mining for building the bilingual or multilingual parallel corpus. This paper aims to build a large-scale Web-based automatic acquisition system of bilingual parallel corpus. The main contributions are identified as follows:1. Study discovery and access to mixed-languages Web pages automatically.Bilingual parallel resources on the Internet can be divided into two categories:one category is a bilingual resource distribution between the two pages, two pages described in different languages with the same meaning, which are called bilingual parallel pages; the other is Bilingual resources located in the same page, which are called mixed-languages pages. Previous systems are mainly based on the first category, but through observation, we found that there are a large number of mixed-languages pages on the Web, and the parallel texts are neater and the translation quantity is higher, which are very valuable resources of bilingual corpus.The bilingual parallel pages exist address similarity or structural similarity and the treatments are already very mature, but these methods can not be applied to mixed-languages pages. The distribution of candidate mixed-languages pages is usually uncertain, and the lack of some common heuristic information makes the discovery more difficult. This paper presents a method of discovery the mixed-languages pages automatically based on the strategy of tentative download, using this method to get the eligible candidate mixed-languages pages close to accuracy of 100%. 2. Study the method of extracting bilingual parallel sentence pairs from mixed-languages pages.The main tasks of extracting bilingual parallel sentence pairs from mixed-languages pages can be divided into three parts:Web-noise filtering, mixed-languages pages identifying and sentence alignment. In this paper, we realized two kinds of method to filter Web noise:a dedicated template-based approach and a common approach based on the Html tag tree. The identification of mixed-languages pages are performed through two-step experiments, respectively, the first step is based on the ratio of character number and the second is based on the ratio of translation. Finally, we convert the parallel passages to parallel sentences using the method of hybrid-information-based alignment.This paper solved these three difficult problems and realized an automatic mining system based on mixed-languages pages.3. Study the application of Web bilingual parallel corpus.We apply the bilingual parallel sentences obtained from Web to the training of statistical machine translation model, during which we proposed the sentence quality sorting method and information retrieval method to loaded the Web corpus into the training experiment. The result proves that the two strategies can improve the translation system performance. Experiments conducted on the IWSLT tasks show+2 to+5 BLEU gains over baseline.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络