节点文献

基于k-means的中文文本聚类算法的研究与实现

【作者】 张睿

【导师】 刘晓霞;

【作者基本信息】 西北大学 , 计算机软件与理论, 2009, 硕士

【摘要】 在机器学习、数据挖掘等领域得到普遍应用的k-means算法由于具有时间复杂度低的优点,在文本聚类领域也得到了广泛的应用。论文对文本聚类的相关技术与算法进行研究,针对文本数据高维性和稀疏性的缺点,改进了文本聚类中的特征选择方法,以及与k-means相关的算法,并在此基础上设计并实现了一个中文文本聚类原型系统。主要工作有:1)聚类领域进行特征选择时由于缺乏类信息而难以选择出最具类区分能力的特征词。在文档频率,单词贡献度两种特征选择方法的基础上,利用贪心算法对特征进行增量选择。实验表明改进的算法可以在保证聚类质量的前提下过滤更多的特征词。2)文本数据高维性和稀疏性的特点使得文本对象间的相似度不易度量,根据文本间的相似度为k-means算法选择的始聚类中心时可能不能很好的代表整个文本集。针对该缺点,对k-means算法中的初始化问题,提出一个改进的初始聚类中心选择方法。实验表明改进的方法选择到初始聚类中心比较分散且代表性好。3)为了提高聚类中簇的质量,通过引入共享最近邻相似度中邻居的概念,对bisecting k-means算法进行改进,实验结果表明该算法的聚类质量较原算法有一定的提高。在以上研究工作的基础上,实现了基于k-means的中文文本聚类原型系统。通过实验对系统中的各个算法进行了评测和比较。

【Abstract】 As a widely used algorithm in machine learning and data-mining, k-means is also used in document clustering for its low time complexity .This paper mainly focus on the how to improve the performance of document clustering algorithm. Based on existing research, improved k-means algorithms and new feature selection method are proposed. Design and implement a Chinese document clustering System on the basis of the proposed algorithms. Works achieved in this paper are as follow:1) It is hard to select features for unsupervised feature selection methods used in clustering due to the lack of class label information. Based on document frequency and term contribution, greedy algorithm is introduced to select features incrementally .Experiments show that the proposed method can remove more features than traditional methods without degrading the clustering quality.2) In order to improve the clustering quality of k-means, well separated initial centroids should be selected. Initial centroids are aurally hard to select due to the high dimensionality and sparseness of document data. A new method for selecting initial centroids is proposed. Experiment show that the centroids selected by the proposed method are well separated and with high representative.3) In order to improve clusters quality of the bisecting k-means, neighbor used in shared nearest neighbor is introduced. Experiments show that the improved algorithm performs better than the original one.Design and implement a document clustering system using the algorithm mentioned above. Each algorithm in the system is contrasted and evaluated through experiments.

  • 【网络出版投稿人】 西北大学
  • 【网络出版年期】2009年 08期
节点文献中: