节点文献

基于成对约束的半监督聚类算法研究及其并行化实现

Semi-Supervised Clustering Algorithm Based on Pairwise Constraints and Its Parallel Implementation

【作者】 林超

【导师】 杨燕;

【作者基本信息】 西南交通大学 , 计算机应用技术, 2013, 硕士

【摘要】 作为数据挖掘领域中的一种重要方法,聚类分析能够发现数据对象自然的分布结构。它根据数据对象之间的相似性,把数据对象分割成簇,并保证同一簇内中数据的相似性尽可能大,而不同簇间数据的相似性尽可能小。从机器学习的角度来看,聚类分析是一种无监督的学习方法,它按照一定的优化准则对数据进行分割,对数据的分析不需要知道其相关的背景知识。但是,现实生活中我们对数据的信息并不是一无所知,并且我们发现通过这些少量的已知信息能够找到数据对象标识或相互之间的约束信息。半监督聚类就是在传统的无监督聚类算法中引入先验知识来指导聚类过程,提高聚类结果精度。本文选择引入成对约束作为先验知识来协助指导聚类过程,分别建立了Must-Link和Cannot-Link约束组,用以描述两个样本数据间的关系。其中,Must-Link代表两个样本数据必须被分配到同一划分,而Cannot-Link则代表两个样本数据必须被分配到不同的划分。详细介绍了基于成对约束的半监督聚类算法Cop-Kmeans,对算法比较常见的约束违反的问题,提出了全新的改进方法,在解决约束违反的同时,算法的运行时间效率也优于传统的改进方案。此外,针对成对约束自身特征可能给聚类性能带来的不良影响,进一步提出了相应的改进方案,能够最大限度的削弱这种不良影响,从而能够在一定程度上改善聚类结果精度。考虑到当聚类对象是一个大数据集或者高维数据类型时,传统的单机串行聚类算法无论是在内存或者运算能力都无法满足实际需求。本文选择运用“云计算”思想,采用并行处理方式处理大规模的数据集。我们利用MapReduce计算模型对改进的半监督聚类算法进行并行化实现,并在Hadoop搭建的并行处理平台上处理大数据集。实验结果表明,采用并行计算方式能够显著提高聚类效率。

【Abstract】 As an important method in the field of data mining, cluster analysis is able to find the natural distribution structure of the data objects. It is a process that divides objects into the similar class according to their attribute. The goal of the cluster is that the similarity of objects from the same group is larger than the similarity of objects from the different group. From the perspective of machine learning, clustering analysis is an unsupervised learning method, and we don’t need any background knowledge when analyze on data objects. However, we can always get some information of the data objects to be analyzed, and we find that a small amount of known information can help find the data object identifier or constraint information between two instances. By adding prior knowledge to the traditional unsupervised clustering algorithm and guide the whole clustering process, then we get semi-supervised clustering algorithm with a high accuracy of clustering result.In this thesis, we select pairwise constraints to help guide the clustering process. Generally, pairwise constraints contain two parts:Must-link and Cannot-link, they describe the relationship between two samples of data. Wherein, Must-link represents two samples must be assigned to the same cluster, while Cannot-link represents two samples of data must be assigned to the different cluster. This thesis also introduces the semi-supervised clustering algorithm Cop-Kmeans in details, which is based on pairwise constraints. We put forward a new and improved method to solve the constraint violation exists in the Cop-Kmeans, the efficiency of the algorithm is also better than the traditional improvement program. In addition, we find the pairwise constraints may have an adverse effect on clustering performance, so we further propose a corresponding improved program. It is possible to weaken such adverse effects, and improve the accuracy of the clustering result to a certain extent.Since the traditional serial clustering algorithm can not meet the requirements either in memory or computing speed when clustering object is a type of large data sets or high-dimensional data, and inspired the idea of "cloud computing", the thesis deals with large-scale data sets in a parallel way. We use Hadoop to set up a parallel processing platform, and parallelize the proposed algorithm according to the MapReduce computing model, so that it can efficiently handle large data sets. Experiments show that parallel computing model can significantly improve the efficiency of clustering.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络