

Identification and Extraction of Phrasal Paraphrase

【作者】 刘树伟

【导师】 刘挺;

【作者基本信息】 哈尔滨工业大学 , 计算机科学与技术, 2009, 硕士

【摘要】 复述是指对相同语义的不同表达,复述研究在众多自然语言处理的应用领域中都有重要的意义。本文的主要研究任务是短语级复述资源的获取。这项研究工作的目的和意义是为基于统计机器翻译的复述生成模型提供更多资源,从而提高复述生成的质量。本文抽取短语级复述的方法共包括两个步骤:复述短语候选的获取和复述候选的确认。复述短语候选的获取使用了基于可比新闻的方法,此类方法的主要优点在于互联网上可比新闻的数量众多,因此使用该方法可以构建相当规模的复述短语库。基于可比新闻提取候选的步骤包括获取新闻语料,基于新闻内容的相似度和新闻发布时间的间隔获取可比新闻,从可比新闻中提取可比句,以及从可比句中提取复述短语。复述候选的确认使用基于二元分类的方法,其重点是分类特征的设计。本文所使用的特征主要是基于复述语料的统计特征,其中包括基于χ~2方法的词对齐特征,基于互信息方法词对齐特征以及基于χ~2方法的词性标注模板对齐特征。前两个特征是词汇层面上的统计特征,后一个为以词性信息为模板的统计特征。除此之外,我们还使用了一些简单的短语串相似特征,如词长度比,词重叠率,编辑距离特征。实验结果表明了使用基于可比新闻的方法可以获取大规模的复述短语,并根据特征比较证明了每一类特征对分类准确率提高均有贡献,其中以基于χ~2方法的词对齐特征的贡献最大。基于可比新闻的方法共获取复述短语2,961,739对,其准确率为21.47%。我们使用4类特征对2,961,739对复述短语候选进行分类确认,最终共抽取出595,619对复述短语,其准确率为59.3%,提高了37.83%。

【Abstract】 Paraphrases are alternative ways to convey the same information. Paraphrases are important in plenty of natural language processing (NLP) applications. This paper mainly focuses on the extraction of phrasal paraphrase corpora. The significance of this study is to provide more resources for paraphrase generation model based on statistical machine translation (SMT) in order to improve paraphrase generation performance.In this paper, phrasal paraphrase extraction method contains two steps: candidate extraction and paraphrase identification. Candidate extraction method is based on comparable news. The main advantage of such method is that there are a large number of comparable news on Internet, so large-scale paraphrase phrases are extracted by the method. Candidate extraction method includes four steps: crawling news corpus, obtaining comparable news based on news content similarity and time interval features, extracting comparable sentences from comparable news, as well as further extracting paraphrase phrases from comparable sentences. Paraphrase identification method is based on binary classification, which focuses on the design of classification features. This paper mainly uses statistical features based on paraphrase corpus, including word alignment feature based onχ2, word alignment feature based on mutual information, as well as part-of-speech template alignment feature based onχ2. The first two features are lexical and the last one is used to extract syntactic paraphrases. In addition, we also used a few phrase string similarity features, such as word length ratio, word overlap number, word edit distance.The experimental results show that candidate extraction method based on comparable news is ability to obtain large-scale paraphrase phrases. Feature evaluation results show that each type of feature helps to increase classification performance, especially word alignment feature based onχ2. Using the method based on comparable news, we extract 2,961,739 pairs of paraphrase phrase candidates, the precision of which is 21.47%. By further identifying based on classification method, we finally obtain 595,619 pairs of paraphrase phrases, the precision of which is 59.3%, increasing by 37.83%.

  • 【分类号】TP391.1
  • 【下载频次】68

