节点文献

半监督聚类集成理论与技术研究

Research on Theory and Technology of Semi-Supervised Clustering Ensemble

【作者】 陈大海

【导师】 杨燕;

【作者基本信息】 西南交通大学 , 计算机应用技术, 2013, 硕士

【摘要】 聚类分析是数据挖掘和机器学习领域一种重要技术方法之一,在很多领域都有广泛的应用,尤其应用在对大数据等问题的处理和分析上。聚类根据一种给定的相似性度量方式,将所有数据对象划分为不同的簇,要求簇内相似度最大而簇间相似度最小在实际问题的解决中,无监督的聚类方法不能利用少量的先验知识,单一的聚类算法很难满足对结构和分布复杂多变的数据集合的处理。半监督聚类集成技术正好弥补了这方面的缺陷,充分利用半监督学习和集成学习技术,并将其应用到聚类分析中,可以有效的提高聚类的性能。然而由于半监督聚类集成研究刚刚兴起,其很多理论机理知识不是很成熟,理论方面的研究可以为半监督聚类集成技术的发展提供有力的支撑。半监督聚类集成技术充分的利用先验知识指导聚类过程,提高聚类的性能,同时利用集成学习的思想,将多个基聚类结果进行组合达到更优化的划分效果。受半监督学习和聚类集成等技术研究的启示,结合概率统计的知识,本文对半监督聚类集成的相关理论进行了数学分析和讨论。在对半监督聚类集成模型和参数进行相关假设的前提下,对其收敛性进行数学证明和分析;引入鲁棒半径的概念来表示鲁棒性程度的范围,对半监督聚类集成的鲁棒性进行分析。然后本文提出一种基于关联矩阵的统一类标签方法,对基聚类(划分)类标签进行统一对齐,将先验知识以约束对的形式加入到基于多数投票法的半监督聚类集成模型中。实验结果表明,先验知识可以提高基聚类和半监督聚类集成的性能,半监督聚类集成具有收敛性和鲁棒性等性能,改进的基于多数投票法的半监督聚类集成方法可以获得较好的聚类效果。半监督聚类集成技术,能够有效的利用先验知识指导聚类和集成过程,且通过融合具有一定差异性的基划分结果,可以有效的提高聚类的性能。本文基于统计学知识,证明了半监督聚类集成方法具有收敛性,同时分析了其鲁棒性性能,提出一种鲁棒性度量方法;提出了一种基于多数投票的半监督聚类集成模型。实验结果表明,随着差异性基划分成员数量的增加半监督聚类集成结果具有收敛性,且其鲁棒性性能也比较好;充分利用先验知识后,基于多数投票法的半监督聚类集成方法可以有效的提高聚类的性能。

【Abstract】 Clustering analysis is an important technology in the areas of data mining and machine learning. It is widely used in many fileds, especially in the processing and analysis of the big data. According to a kind of given measure of similarity, clustering can divide all the data objects into several clusters, which should maximize the similarity between intra-class objects and minimize the similarity between inter-class objects. In practical issues, unsupervised clustering algorithms perform without considering any prior knowledge, and a single clustering algorithm is very hard to meet the processing of datasets which structure or distribution is complex. But the simi-supervised clustering ensemble can just to make up for this deficiency, which makes full use of semi-supervised learning and ensemble learning technology to clustering analysis. It could effectively improve the performance of clustering. However, due to the research of simi-supervised clustering ensemble is just emerging, and there are few studies in the theoretical analysis. The theoretical study can provide solid foundation for the development of semi-supervised clustering ensemble.Semi-supervised clustering ensemble technology is fully used of the prior knowledge to guide the clustering process, which can improve the performance of clustering, at the same time it uses ensemble learning technology to combine the base clusterings to get better results. By the revelation of the semi-supervised learning and clustering ensemble research, and combining the knowledge of probability and statistics, this thesis presents the mathematical analysis and discussion for semi-supervised clustering ensemble. Based on some assumptions, it gives the mathematical proof and analysis of convergence for semi-supervised clustering ensemble in the thesis. The author proposes the concept of robust radius to measure the degree of robustness and analyse the robustness of semi-supervised clustering ensemble. This thesis discusses a new relabeling approach based on contingency matrix to unify the base clustering (partition) labels, and then use pairwise constraints in the form of the prior knowledge, added to the model of semi-supervised clustering ensemble based on majority voting. The experimental results show that prior knowledge can improve the performace of base clustering and semi-supervised clustering ensemble, and semi-supervised clustering ensemble is provided with convergence and robustness, and the approach can obtain a better clustering effect.Semi-supervised clustering ensemble technology can effectively utilize the prior knowledge to guide clustering and ensemble process, which improve the performance clustering by aggregating multiple diversity partitions. In this thesis, it proves the convergence of semi-supervised clustering based on statistical technology, and presents a robust measure to analyse the robustness. And then a new semi-supervised clustering ensemble model based on majority voting is proposed. The experimental results show that, with increasing of the diversity base partitions number, semi-supervised clustering ensemble will be convergence and robustness. By prior knowledge, semi-supervised clustering method based on majority voting can get better performance than other clustering ensemble alogrithms.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络