节点文献

基于DK-Means算法的文本聚类的研究与实现

Research and Implementation of Text Clustering Based on DK-Means

【作者】 于丽丽

【导师】 刘辉林;

【作者基本信息】 东北大学 , 计算机应用技术, 2008, 硕士

【摘要】 随着信息技术在各个领域的普及,各种应用每天产生的数据量呈指数级增长。如何有效处理这些数据,从中提取有用的知识,是迫切需要解决的问题。数据挖掘是为了满足人们对数据中所蕴涵的信息和知识的充分理解和有效应用而发展起来的一门新兴技术。聚类分析根据数据对象之间的相似度将数据集划分为几个类或者簇,是发现数据内部结构和知识的很好的方法。聚类分析是根据样本之间的某种距离在无监督条件下的聚簇过程,利用聚类方法可以把大量的文本划分成用户可以迅速理解的簇,从而使用户可以更快地把握大量文档中所包含的内容,加快分析速度并辅助决策。聚类分析已利用在各个领域,例如,模式识别,图象处理,信息检索等多个学科。根据不同需求,聚类数据集的类型也各不相同,例如,有序数型、标量型、文本型、混合型等数据,本文主要研究了对文本类型的数据进行聚类。本文对文本聚类中所涉及的文本降维方法和聚类算法进行了研究。首先,在文本预处理中,提出了结合词频的分词方法,提高了分词准确性的同时,为后边的文本模型的构建,文本降维等做好准备;其次,提出了基于文本相似的文本降维方法,该降维方法,通过计算文本与其他文本的相似性,计算特征词对文本类属性中的贡献度来提取与文本高度相关的词,起到了文本降维的效果,提高了文本聚类的效率和精确度;最后,提出了基于DK-Means的文本聚类算法,该方法与原有方法相比提高了聚类准确度和聚类速度。本文首先对属于数据挖掘领域的聚类分析技术进行了介绍,然后讲述了文本聚类的相关技术,包括文本的预处理、文本表示模型、降维技术和文本聚类算法(K-Means, BIRCH, CURE, OPTICS等),再次研究了新的文本降维方法和聚类算法,对于特特征降维方法,提出了新的基于文本相似的文本降维方法。最后根据提出的算法设计和实现了文本聚类。经过测试,表明以上提出的方法,不仅在聚类的准确率和纯度方面有所提高,而且提高了文本聚类的速度。

【Abstract】 As popularization of information technology in various fields, the data of variety application is generated by an exponential growth level. Dealing with these data effectively and extracting useful knowledge is a problem to solve urgently. Data Mining is new technology for meeting the full understanding and effective application of the information and knowledge contained in the data. Clustering the Data is better way to find the Structure and knowledge in the data. The cluster analysis is dividing the data into several categories or clusters according to the similarity between data. The Cluster analysis is better pretreatment with data collected before statistical analysis.The cluster analysis is a clustering process according to the similarity in the absence of supervision. The Documents will be divided into the cluster using cluster analysis that can be understood by user. So, the users can master the content of a large number of texts rapidly, and accelerate the pace of analysis and help making decision. Cluster analyses have been used in many fields, for example, pattern recognition, image processing, IR, and other disciplines. The type of data sets is different according to different demand. For example its have ordinal number, scalar, text, and other types. This paper mainly researches the clustering of the text.In this paper, the approach of text drop-dimensional and algorithm of the clustering involved in the text clustering were researched. Firstly, in the pretreatment of text, the method of segment combined with frequency of word was proposed. It can improve the accuracy of the segment and prepared for the construction of text Model and text drop-dimensional. Secondly, the method of drop-dimensional based on the similarity of text was proposed. It extracts the word of highly relevant to the text by calculating the word’s contribution to the text category. It improves the efficiency and precision of the text clustering. Finally, the paper proposed algorithm of the text clustering based on DK-Means that improve the accuracy of the clustering and clustering speed.The paper firstly introduce the cluster analysis technology belong to the field of data mining. Then, it introduce the technology related to the text clustering including the pretreatment of text, the text model, technology of feature drop-dimensional and the algorithm of the text clustering, and proposed the new method of feature drop-dimensional based on text’similarity and new algorithm of the text clustering. Finally, the paper design and implement the text clustering according to the new method of feature drop-dimensional and the text clustering’algorithm. After experiment, it not only improves the accuracy and purity of the text clustering’result, but also improve the speed of the text clustering.

  • 【网络出版投稿人】 东北大学
  • 【网络出版年期】2012年 03期
  • 【分类号】TP391.1
  • 【下载频次】98
节点文献中: 

本文链接的文献网络图示:

本文的引文网络