节点文献
基于词语网络的关键字提取策略研究
Research about Term Network Based Keywords Extraction Strategy
【作者】 阚洳沂;
【导师】 唐雁;
【作者基本信息】 西南大学 , 计算机软件与理论, 2008, 硕士
【摘要】 关键字是表述文档中心内容的词汇,是计算机系统标引论文内容特征的词汇,是便于信息系统汇集以供读者检索的词汇。关键字提取是文本挖掘领域的一个分支,是文档检索、文档比较、摘要生成、文档分类和聚类的基础性工作。关键字提取算法可分为两类:基于训练集的关键字提取策略和不需要训练集的关键字提取策略。基于训练集的方法将关键字提取视为分类问题,通过将文档中出现的词语划分到关键字类或非关键字类,再从关键字类中选择若干个词语作为关键字,该类算法由Peter.D.Turney首次提出,其技术已日趋成熟。不需要训练集的算法,可分为以下四类:基于统计的方法,如频率统计;基于词语图的方法,如KeyGraph;基于词语网络的方法,如中介性指标(BC,Betweenness Centrality);基于SWN的方法;上述四种方法都是建立在词频统计基础上。基于统计的方法简单快速,能够提取高频词语,却忽略对文档具有重要意义但出现频率不高的词语,因此提取的关键字具有片面性。基于词语图的方法需要设定的参数过多,如顶点数、边数等,因而常造成边界上的取舍问题,影响算法的稳定性和精度。基于SWN的方法是以平均距离长度为关键字提取依据,而SWN理论以连通图为基础,故对非连通的文档结构图,无法衡量顶点的重要性,也无法正确地提取文档关键字。本文主要研究基于词语网络的关键字提取算法,在分析已有基于词语网络的关键字提取算法的基础上,针对存在问题,提出一个新的基于词语网络的英文文档关键字提取策略,采用节点删除指标度量顶点(词语)的重要性。所提取的关键字不仅包括高频单词和短语,而且包括对文档中心内容贡献大但出现频率不高的单词和短语。实验数据来自KEA和Extractor算法中的测试数据集,及世界著名的科技出版集团之一——德国施普林格提供的学术期刊及电子图书的论文为测试数据。以论文作者提供的关键字为基准,采用平均准确率和平均召回率作为衡量提取效果的依据,通过将本文算法的实验结果与TF和BC算法的实验结果相比较,证明了本文算法的正确性和有效性。
【Abstract】 With the advent of Internet since 1990, we have seen a tremendous growth in the volume of online text documents available on the Internet, such as electronic emails、web pages、and digital books et al. To make more effective use of these documents, there is increasingly need for tools to deal with text documents. To meet such increasingly needs, some product for analyzing text documents has been developed. All techniques involved in document analysis have formed a new exciting research area often called as Text Mining.Keywords extraction plays a very important role in the text mining domain, because keywords are useful for a variety of purposes, including summarizing, indexing, labeling, categorizing, clustering, highlighting, browsing, and searching. The task of automatic keyword extraction is to select keywords from the text of a given document. Automatic keywords extraction makes it feasible to generate keywords for the huge number of documents that do not have manually assigned keywords.There are some previous approaches on keywords extraction: 1 Supervised Classification, Turney firstly approach the problem of automatically extracting keywords from text as a supervised learning task, he treats a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keywords. The performance has been satisfactory for a wide variety of applications. 2 Unsupervised Classification, these keywords extraction algorithms that applies to a single document without using a corpus are presented, such as term frequency, based on SWN, the term graph, the term network..Based on the analysis of existing keywords extraction using term network, an effective algorithm is proposed to extract not only high frequent terms, but also important terms with low frequency. It bases on the term network and deleting actor index. The experiment results support the conclusion.
【Key words】 deleting actor; co-occurrence; keyword extraction; term network; betweenness centrality;
- 【网络出版投稿人】 西南大学 【网络出版年期】2008年 09期
- 【分类号】TP391.1
- 【被引频次】2
- 【下载频次】429