节点文献

基于相似度的文本聚类算法研究及应用

Study on Similarity-based Text Clustering Algorithm and It’s Application

【作者】 曾路平

【导师】 李星毅;

【作者基本信息】 江苏大学 , 计算机应用技术, 2009, 硕士

【摘要】 文本聚类是文本挖掘的一个重要分支,因其独特的知识发现功能而得到较为深入的研究。文本聚类算法已经在文档自动整理、检索结果的组织和数字图书馆服务等方面得到了广泛的应用。但是在应用中随着文本集的不断扩大,传统的文本聚类算法遇到了一些难以克服的困难,算法忽略了文本中单词之间的语义相关性,算法聚类结果不稳定等。论文主要针对以上问题对文本聚类进行研究。论文首先详细介绍了传统的文本聚类算法,并对其进行比较和分析。其次,为了解决向量空间模型忽略单词之间的语义相关性的问题,提出了一种基于单词相似度的文本聚类算法(TCWS);针对传统K-Means算法聚类结果不稳定的缺点,提出了一种基于文本平均相似度的K-Means算法(KAAST)。最后,将研究成果应用到公安情报系统中。本文的主要研究内容概括如下:(1)介绍了常用文本聚类算法,并从伸缩性、多维性、处理高维数据的能力等方面对常用文本聚类算法进行分析和比较。(2)提出一种基于单词相似度的文本聚类算法(TCWS)。该算法首先利用单词相似度对单词进行聚类获得单词之间的语义相关性,然后利用产生的单词类作为向量空间模型的项表示文本,降低了向量空间的维度,最后采用基于划分聚类算法对文本聚类。实验表明TCWS算法提高了聚类结果的正确性。(3)提出一种基于文本平均相似度的K-Means算法(KAAST)。该算法首先构造文本平均相似度集合,其次从集合中选取当前平均相似度最大的文本作为初始聚类中心,同时删除集合中与其簇相关的文本,这样选取出的中心点不但具有代表性且分散,最后利用选取的中心作为K-Means算法的初始聚类中心对文本聚类。实验表明KAAST算法的稳定性有较大的提高。(4)在理论研究的基础上,将本文提出的算法应用到公安情报系统中,并设计和实现了文本聚类子系统,提高了情报处理的效率和正确性。

【Abstract】 Text Clustering is an important branch of Text Mining,which has get more depth research because of its unique knowledge discovery functions.Today,there are lots of efficient text clustering algorithms which have been widely used in the automatic document finishing,the organization of search results and digital library services.However,with expansion of document sets,traditional text clustering algorithm encountered a number of insurmountable difficulties.For instance, algorithm ignores the semantic correlation between words,the instability of result. These papers mainly for the above problems do some research on text clustering.First,we introduce the traditional text clustering algorithms.We compare and analyze the traditional text clustering algorithms.Secondly,to solve the vector space model ignoring the semantic correlation between words,we propose a text clustering algorithms based on word similarity(TCWS).Due to the traditional K-Means algorithms have an shortcoming of clustering results instability,we propose a K-Means algorithms based on average similarity of text(KAAST).Finally,research results be applied to public security information system.The works in this article as follows:(1) Introduced to the traditional text clustering algorithm,and they were compared and analyzed from the scalability,multi-dimensional,dealing with high dimensional data and so on.(2) We propose a text clustering base on words similarity algorithm.First of all, TCWS algorithm use of word similarity classification of words,access to word semantic relevance between words,and then make use of the word classification as a vector space model category of items with text that reduced dimension of vector space model,finally,used partitioning clustering algorithm.Experiments showed that TCWS algorithm improve the accuracy of clustering results.(3) We propose a K-Means base on average similarity of text algorithm.First of all,structural average similarity of text collection,Secondly,selected from collection of the greatest average similarity of the text as the initial cluster center,at the same time,needs to delete the text which cluster associated with the initial cluster center. Selected initial cluster center not only on behalf of and scattered.Finally,used to the selected center as the initial cluster centers of K-Means algorithm.Experiments showed that KAAST algorithm improve d stability.(4) According to above theory research,the algorithms presented in this article are used to the public security information system,and Design and Implementation of a text clustering system,which can improve efficiency and correctly.

  • 【网络出版投稿人】 江苏大学
  • 【网络出版年期】2009年 09期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络