节点文献
文本聚类集成关键技术研究
Research of Key Technologies on Document Cluster Ensemble
【作者】 徐森;
【导师】 顾国昌;
【作者基本信息】 哈尔滨工程大学 , 计算机应用技术, 2010, 博士
【摘要】 聚类分析是数据挖掘、模式识别等方向的重要研究内容之一,已被广泛用于数据压缩、信息检索、语音识别、字符识别、图像分割和文本聚类等。另外,在生物学、地质学、地理学、市场营销和异常数据检测等方面也受到越来越多的关注。目前,已有上千种聚类算法,然而没有一种算法可以成功识别出具有不同大小、不同形状、不同密度甚至可能包含噪声的簇。文本数据具有高维、稀疏等特点,这使得许多聚类算法并不适用于文本聚类;另外,文本集规模的海量性对聚类算法的运行效率也提出了很高的要求。作为传统聚类算法的重要扩展,聚类集成技术具备了传统聚类算法所不具备的诸多优点。目前,聚类集成已经发展成为机器学习领域的研究热点之一。本文以文本聚类为应用背景,针对文本聚类集成中的关键问题进行了研究,取得的创新性研究成果包括:(1)鉴于谱聚类方法的诸多优点,本文将基于矩阵扰动理论和谱图理论的谱聚类算法引入到文本聚类集成问题中。针对谱聚类算法计算复杂度高的问题,本文基于代数变换,首先将大规模矩阵的特征值分解问题转化为等价的奇异值分解问题,并进一步转化为小规模矩阵的特征值分解问题。由此设计了两个不同的文本聚类集成谱算法SMSA和TMSA。实验结果表明:本文的代数变换方法是切实可行的,代数变换前后算法的运行时间大幅度减少,而且获得的结果非常接近;SMSA和TMSA比基于图划分的聚类集成算法更优越,是解决文本聚类集成问题行之有效的方法。(2)本文研究了谱聚类算法的关键思想,从求解“最佳”子空间出发,同时推导出文本和超边的低维嵌入,由此设计了两个基于子空间相似度的聚类集成算法SSICA和SSDCA,实验结果表明:SSICA和SSDCA都获得了比基于图划分的聚类集成算法更优越的结果;SSICA的聚类质量略高于SSDCA。本文进一步泛化SSICA,设计出基于低维嵌入的文本聚类集成方法。该方法首先通过不同的谱聚类算法获得了超边的低维嵌入;随后通过映射的复合间接获得了文本的低维嵌入;最后根据文本在低维空间下的坐标使用简单K均值算法聚类。实验结果表明,该方法比其它常见的基于图划分的聚类集成方法优越,可以有效解决文本聚类集成问题。(3)本文将非负矩阵分解(NMF)引入到文本聚类集成问题中,设计了BNMF算法;由于NMF算法收敛速度较慢、易于收敛到较差的局部最优解,本文使用K均值初始化NMF,设计出NMFK算法;另外,针对K均值算法随机初始化所带来的聚类结果不稳定问题,本文使用最小最大原则确定K均值算法的初值,设计出NMFKMMP算法。实验结果表明:使用K均值算法初始化NMF是有效的,NMFK获得了比BNMF算法更加优越、稳定的结果,且运行效率也比BNMF高出许多;NMFKMMP算法可以有效解决文本聚类集成问题,NMFKMMP算法运行高效,并且获得了比其它常见的聚类集成算法更加优越的结果。(4)超球K均值算法不能有效识别非超球状的簇,因此易于产生精度较低的文本聚类集成成员。为了进一步提高文本聚类集成算法的聚类质量,本文在集成成员生成阶段引入了CHAMELEON算法的关键思想——“分裂—合并”(DM)策略。首先在聚类成员生成阶段运行使用DM策略的SKM算法r次,每次生成较多的文本子簇,并根据子簇的相似性使用Ward算法合并这些子簇,得到r个聚类成员,随后在聚类集成阶段采用本文设计的聚类集成算法进行集成。实验结果显示,除了基于图划分的聚类集成算法外,基于层次聚类方法的4个聚类集成算法以及本文设计的基于谱聚类方法、基于低维嵌入方法和基于非负矩阵分解方法的多个文本聚类集成算法在使用DM策略后获得的平均规范化互信息(NMI)都有不同程度的提高,这表明DM策略可以有效提高聚类集成算法的聚类质量。
【Abstract】 As one of the most important research topics in data mining and pattern recognition, clustering analysis has been widely used in areas such as data compressing, information retrieval, speech recognition, characters recognition, image segmentation, document clustering, etc. Besides, it has been attracted more and more attention in biology, geology, geography, marketing and abnormal data detection. Thousands of clustering algorithms exist, yet no one can successfully recognize clusters with different sizes, different shapes, and different densities or even with noises. Many of them are not applicable for document clustering due to the high dimensionality and sparseness of documents; furthermore, the magnanimity of document sets imposes high efficiency on clustering algorithms. As an important extension to convenient clustering algorithms, cluster ensemble technique has become a hotspot in machine learning area because it has many advantages that convenient clustering algorithms do not have. Taking document clustering as application background, this dissertation studies the key problems in document cluster ensemble, and obtains the following innovative contributions:(1) Spectral clustering algorithms which are based on matrix perturbation theory and spectral graph theory, respectively, are brought into document cluster ensemble problem. To make the algorithms extensible to large scale applications, the eigenvalue decomposition problem of large scaled matrixes is avoided by solving the eigenvalue decomposition of small matrixes, and thus computational complexity of the algorithms is effectively reduced. Experiments on real-world document sets show that (a) the algebraic transformation is feasible; (b) both of the proposed ensemble algorithms based on spectral cluatering algorithms outperform other common cluster ensemble techniques based on graph partitioning, and they can effectively solve document cluster ensemble problem.(2) The key idea of spectral clustering algorithms is introduced into solving document cluster ensemble problem. Beginning with seeking the“best”subspace, the low dimensional embeddings of documents and hyperedges are attained simultaneously, whereupon two simple and fast algorithms, SSICA and SSDCA, are proposed. Experiments on real-world document sets show that the proposed algorithms outperform other common methods and SSICA is a little better than SSDCA. SSICA is further extended and a low dimensional embedding method is proposed. Firstly, spectral clustering algorithms are performed to obtain the low dimensional embeddings of hyperedges. Then the low dimensional embeddings of documents are attained indirectly by composition of mappings and finally K-means algorithm is performed to cluster the documents according to their coordinates in the low dimensional space. Experiments on real-world document sets show that the proposed method outperforms other common cluster ensemble technologies based on graph partitioning algorithms and can also effectively solve document cluster ensemble problem.(3) Non-negative Matrix Factorization (NMF) is brought forth into document cluster ensemble problem and BNMF is proposed. Since NMF converges slowly and tends to converge to poor solution, NMFK is proposed which use K-Means to initialize NMF. Furthermore, in order to solve the unstable problem due to the random initialization of K-Means, NMFKMMP is proposed which uses minimum and maximum principle to attain the initial centroids. Experimental results show that: (a) using K-Means to initialize NMF is effective because NMFK can get better and more stable results than BNMF, and it is more efficient than BNMF; (b) NMFKMMP can effectively solve document cluster ensemble problem because it outperforms other cluster ensemble technologies and is very efficient.(4) To further boost the performance of cluster ensemble algorithm, the key idea of CHAMELEON, Divide and Merge (DM) strategy, is introduced to reinforce the ability of spherical K-means algorithm to discover clusters with irregular shapes. In ensemble member generation phase, SKM was first performed to obtain multiple document sub-clusters for r times, each time using Ward, an agglomerative hierarchical method, to merge these sub-clusters according to their similarity, and attained an ensemble of r clusterings. Then the cluster ensemble algorithms are performed to ensemble the r clusterings. Experiments on real-world document sets show that DM strategy increased with different degrees the normalized mutual information (NMI) scores of the cluster ensemble algorithms based on hierarchical clustering method, spectral method, low dimensional embedding method and non-negative matrix factorization method except for the graph partitioning-based method. These results prove that DM strategy can effectively improve the performance of document cluster ensemble algorithms.