节点文献

科技论文关键词抽取技术的研究

Study on Key Phrase Extraction Technology for Scientific and Technical Essays

【作者】 严春风

【导师】 姚建民;

【作者基本信息】 苏州大学 , 计算机技术, 2009, 硕士

【摘要】 本文以万方数据和会议集作为测试语料,重点介绍了基于PAT-Tree关键词的抽取方法和知网在关键词抽取中的应用。首先通过实验验证关键词具有的一些特征并介绍了常用的关键词的过滤方法。接着介绍了能够方便快捷地进行全文串频统计的PAT-Tree数据结构以及互信息。在此基础上提出了基于PAT-Tree关键词的抽取方法,抽取过程基于从原始文本中得到的统计信息,取出符合筛选条件的字符串。总体来说分为四个阶段,分别为:对文本进行预处理;在预处理过的文本上建立PAT-Tree,获取文章词频信息;在PAT-Tree上抽取候选关键词;对关键词过滤以及选取关键词。我们把抽取的重点放在了自动过滤符合统计条件的字符串,进一步精选候选关键词上面。我们在精选过程中采用了新的过滤手段,并借鉴了其它方法的优点,形成了一套综合的过滤手段,有效地提高了精确度,减少了计算量。本文的另外一个特色,考虑到会议集是领域语料,特别使用分治法的思想来处理密集计算,高效地建立PAT-Tree,一方面为抽取领域关键词提供了方便,另一方面也使得关键词抽取能够用分布式计算的方法来实现,提供了进一步扩大处理能力的空间。实验结果表明,采用此方法能够高效地抽取关键词,特别是领域关键词的抽取取得了良好的效果,达到了预期目的。最后,引入知网来计算同义词的相似度,以此来解决关键词集合中同义词同现问题和词语由于同义词问题不能进入关键词集合的问题。

【Abstract】 This essay, using Wanfang Data and conference collections as testing materials, focuses on the PAT-Tree-based methods for key phrase extraction and the application of CNKI in key phrase extraction. Firstly, some characteristics of the key phrases and the methods for filtration are verified by experiments. Secondly, it introduces the PAT-Tree data structure, which can conveniently and quickly compile string frequency statistics on the whole text, as well as mutual information. On the above base, it raises the methods for extracting the key phrases according to PAT-Tree. In general, depending on the statistic information from the original text, the character string conforming with the filter criteria is extracted and the extraction process can be divided into four steps, such as: setting up PAT-Tree on the pre-processed text, getting the term frequency, extracting the candidate key phrases on PAT-Tree, and filtrating and electing the key phrases. The emphasis of extraction is on the character string which is automatically filtrated and accords with the statistics criteria and on the key phrases which are carefully picked out, in the period of which the new filtration methods are used, the advantages of other ways are used for reference and an integrated set of filtrating methods is formed, which improves the precision effectively and decreases the calculating quantity. In this essay, the other characteristic is to deal with the denseness calculation by using the divide and conquer method, which effectively sets up PAT-Tree and provides the convenience for extracting the key phrases in the fields. The experiments shows that by such methods the key phrases can be extracted with high effect and the good results can be achieved so that the expected goal can be reached. Finally, CNKI is introduced to calculate the similarity of synonyms, in order to solve the problem that there are synonyms in the key phrase collection and that some phrases can not enter the key phrase collection due to synonyms.

【关键词】 关键词抽取PAT-Tree互信息同义词
【Key words】 Key phrase extractionPAT-Treemutual informationsynonym
  • 【网络出版投稿人】 苏州大学
  • 【网络出版年期】2011年 S2期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络