节点文献

聚类算法及其有效性问题研究

Study of Clustering Algorithm and Its Validity Problem

【作者】 刘明术

【导师】 方宏彬;

【作者基本信息】 安徽大学 , 计算数学, 2012, 硕士

【摘要】 聚类算法是数据挖掘中重要的研究领域,聚类有效性是根据聚类理论方法能判别聚类质量高低的指标.。聚类有效性验证方法主要有基于内部或外部准则的统计假设检验,聚类层次的有效性,单独聚类的有效性,Dunn和类Dunn指标,Davies-Bouldin和类DB指标,Gap统计等。聚类算法常见的有分层聚类算法、网格聚类算法、基于密度聚类算法、基于划分的聚类算法、其它聚类算法等。但这些算法常常采用欧氏距离来度量相似性的,而欧氏距离将样品的不同属性之间的差别等同看待,易受变量之间的相关性干扰,不仅影响聚类的速度和质量,还影响聚类有效性指标的性能,有时不能满足实际要求。另一方面,对两点之间进行距离度量的马氏距离具有很多优点,如它不受量纲的影响,两点之间的马氏距离与原始数据的测量单位无关,由标准化数据和中心化数据计算出的二点之间的马氏距离相同,马氏距离还可以排除变量之间的相关性的干扰。本文探讨了分层聚类算法和欧氏距离的局限性,充分考虑数据的几何结构特征和个体属性,结合马氏距离提出了一种新的属性相似性度量方法及新的聚类有效性函数,并对采用欧氏距离的分层聚类算法进行了改进,实验表明改进算法具有一定的优越性。

【Abstract】 Clustering algorithm is an important research field In data mining and clustering validity which is based on clustering theory method is a discrimination index of clustering quality. Clustering validity methods mainly include the statistical hypothesis based on internal or external criteria, effectiveness of clustering level, separate clustering effectiveness, Dunn and class Dunn index., Davies-Bouldin and class DB index as well as Gap statistics, etc. Clustering algorithms are presented familiarly such as hierarchical clustering algorithm, the grid clustering algorithm, clustering algorithm based on density, and clustering algorithm based on the classification, on the other hand Euclidean distance is used to measure the similarity of different samples in these algorithms. Euclidean distance has some manifest disadvantages such as indiscrimination of samples of different attributes and easy interference of correlation between variables, so that it sometimes can not meet the actual demands because of clustering speed and quality as well as clustering validity index of performance. On the other hand, the Mahalanobis distance owns some advantages such as no influence of dimension, namely no relation between the Mahalanobis distance of two samples and the original data measure, the same value about normalized data and the center of the data basing on Mahalanobis distance of two points, and eliminating interference of between variables.Limitation of the hierarchical clustering algorithm and Euclidean distance is discussed in this paper, moreover a new algorithm of attributes similarity measure and new clustering validity function are presented with respect to Mahalanobis distance in light of the characteristics of data geometrical structure and individual attributes, at the same time hierarchical clustering algorithm basing on Euclidean distance is improved. The improved clustering algorithm is valid by experiments.

  • 【网络出版投稿人】 安徽大学
  • 【网络出版年期】2012年 10期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络