节点文献

基于知识自动获取的无指导译文消歧方法研究

Research on Unsupervised Word Translation Disambiguation Based on Automatic Knowledge Acqusition

【作者】 刘鹏远

【导师】 赵铁军;

【作者基本信息】 哈尔滨工业大学 , 计算机应用技术, 2008, 博士

【摘要】 这是个互联网的世界,使用高效的搜索引擎在互联网获取信息已经成为当代人们获取信息的最重要手段。在日益国际化的信息中,不同种语言之间的理解与处理一直存在着难以逾越的鸿沟,这就形成了人们对机器翻译、跨语言信息检索与处理的迫切需求。目前对此研究仍有诸多难题亟需解决,其主要之一就是,如何为源语言多义词选择语义正确对应的目标语译文词汇的问题,称之为译文消歧。译文消歧及与之相似的在单语范畴内的词义消歧一直是自然语言处理领域基础研究课题,它也是自然语言处理技术的重点和难点之一。针对译文消歧及词义消歧的现状,通过对各类无指导消歧方法的比较分析,本文认为,目前无指导译文消歧的关键问题是消歧知识的自动获取与利用、克服数据稀疏及双语语义词典建设。因此,本文没有在机器学习算法、消歧特征选择等问题上做过多的研究与探讨,而是充分关注与挖掘无指导译文消歧方法中最核心的内容——知识获取,利用这些消歧知识来完成无指导译文消歧任务,同时克服数据稀疏问题。由此思想出发,本文提出了一系列逐步递进的无指导译文消歧知识获取以及消歧的创新方法,这些创新方法均利用了国际标准语义评测语料进行评测与对比分析,并均超过了以往可比较的最好无指导系统。最后,本文还进行了另一个关键问题的研究,即双语语义词典的自动构建。本文具体研究内容包括以下几个方面:1.自动获取带标记目标语语料,并直接形成译文消歧模型,提出利用该模型进行译文消歧的方法。在此基础上,提出了等价伪译词概念以及等价伪译词的构造方法,并以此实现无指导的译文消歧。最后在国际语义评测数据集Senseval-2 ELS上进行了实验与比较;2.通过对双语语料库间接关联的观察,提出利用双语词汇间接关联度的完全无指导译文消歧方法。在计算间接关联度的过程中充分利用了Web资源,设计了Web的词汇间接关联度(Web_IA)的计算方法,在消歧过程中利用了三种不同的决策方法进行决策。随后,针对基于间接关联方法的不足,本文将整个Web视为语义词典,直接利用Web进行双语词汇语义相关度(WBR)的定义分析以及计算。经过对WBR方法在一个经改造处理的标准语义相关度测试集上的比较实验,证明该方法可行后,设计了基于WBR的完全无指导译文消歧方法,并在同一个国际标准语义评测数据集Semeval2007上的task5与基于Web_IA方法做了详细的对比实验;3.通过对歧义词同义词集合内词汇语句序列的观察,提出了一种基于Ngram语言模型以及Web挖掘的无指导译文消歧方法。该方法认为歧义词不同词义所对应的N-gram语言模型模式不同,且利用的是语言模型知识而非语义知识。随后在同一个标准集上的对比评测表明,该方法取得了极为优异的性能。性能超过了该任务可比较最好无指导系统TorMD12.8%(Pmar值),最后,还进行了基于语言模型的方法与基于语义模型的方法的详细比较及性能上限的讨论;4.研究了利用WordNet、HowNet以及大规模双语平行语料库自动生成面向译文消歧的的双语词典的方法。该方法充分利用大规模平行语料库内丰富的词汇对齐知识以及各项统计信息,利用WordNet及HowNet语义资源的相似度计算,形成了一部同时带有双语语义信息及语料库统计信息的词典。综上所述,本文基本上给出了面向知识自动获取的无指导译文消歧的一整套解决方案,特别是其中基于Web的各种方法,为自然语言处理中的难题之一——译文/词义消歧,在基于Web搜索的研究思路上进行了初步探索。

【Abstract】 Internet is the king of the world at present. To most people nowadays, it is the most important mean to acquire information from Internet by an efficient search engine. There is an insurmountable barrier between understanding different kind of languages in this more and more international information so that the most urgent problem for us is to research machine translation (MT) and cross-language information retrieval (CLIR). There are many hard problems have not been resolved on the research yet, one is how to select right target language translation while facing an ambiguity source language word, which is so-called the problem of word translation disambiguation (WTD). WTD and its similar task - word sense disambiguation (WSD) in mono-lingual category are important and hard in the research of nature language processing (NLP) and are always the basis of it.Facing the current situation, through the comparision and analysis on all kinds of unsupervised disambiguation methods, this thesis considers that the key problem of unsupervised WTD/WSD is in three fields: knowledge acquisition, data sparseness and the construction of biligual semantic resourse. So, aiming at the study of knowledege acquisition and conquer data sparseness which are the core problems in unsupervised WTD/WSD, this thesis introduces a series of creative methods on unsupervised WTD knowledge acquisition. All of the methods are evaluated on the international golden set of semantic evalution and are all superior to the best comparable systems and get the state of arts result. Another key problem in WTD - bilingual semantic dictionary auto-construction is also studied.In detail, this thesis is arranged as following:1. Automatic sense-tagged corpus acquisition, forms the disambiguation classifier directly and introduces the classifier to WTD are all studied. On the basis of that, the thesis introduces the concept of Equivalent PseudoTranslation (EPT) and the WTD method based on it. Finally, it is compared with other unsupervised systems on Senseval-2 English Lexical Sample task. 2. A fully unsupervised WTD method based on the indirect association(IA) between bilingual words and Web mining is introduced through our investigation in biligual parallel corpus. Four methods of computing IA on Web are designed and three different kind of decision strategics are used during the disambiguation process. Futhermore, the thesis considers the Web as a special semantic lexicon so that the relatedness between bilingual words (WBR) could be defined and computed directly by using Page Counts which Web search engine returned. After testing WBR on a revised golden data set and proving its feasible, a fully unsupervised method is designed for WTD. Both the IA method and the WBR method of WTD are tested on Semeval 2007 task 5- Multilingual Chinese-English Lexical Sample task and made comparison on it.3. An unsupervised WTD method based on the N-gram language model(LM) and Web mining is introduced after the observation of varies series of synonyms in different sentences. On the basis of the supposition that―different sense, different N-gram pattern‖, the model make disambiguation by using LM knowledge not by semantic knowledge. After testing on the same golden set for comparison with other systems, it shows that the performance of the proposed model is excellent and out-performs all other comparable unsupervised systems. Detail comparison and combining upperbound between LM based model and semantic based model is discussed either4. A methods that can generate a WTD application-oriented bilingual semantic dictionary automatically from WordNet, HowNet and an large-scale blingual parallel corpus is studied. The method mainly uses similarity of WordNet and HowNet between words to filter statistic noise data between the processing of word alignment. Finally, it forms a bilingual semantic dictionary which has imformation of these three resources: WordNet, HowNet and bilingual parallel corpus.In brief, basiclly, a complete set of knowledge acquisition-oriented WTD solution methods has been established, especially the Web-based methods which explored the hard problem in NLP - WTD/WSD on the Web search way.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络