节点文献

针对包含异常值数据的优化K-MEANS聚类算法

Optimization of K-MEANS Clustering Algorithm for Data with Outliers

【作者】 焦昂

【导师】 张军英;

【作者基本信息】 西安电子科技大学 , 计算机应用技术, 2009, 硕士

【摘要】 K-MEANS聚类算法是一种广为应用的简捷的迭代算法,其应用价值和重要性受到很多领域的认可。传统的K-MEANS算法以数据点之间的欧氏距离为测度,误差平方和为目标函数。K-MEANS算法对异常值的存在很敏感,因为“均值”本身就不是一个健壮的统计量。而异常值是这样一种极端的观测结果,它们在数值上远离样本数据集的均值,其存在使得所有基于均值和方差的统计测试一定程度地失真。然而,大样本中总会有一定量的异常值。因此K-MEANS聚类算法的效果不可避免地受到异常值的影响。本文就K-MEANS聚类算法的原理进行了研究,并提出了一个基于异常值删除的K-MEANS优化算法。该算法主要的特点就是利用了K-MEANS聚类算法原理上的缺陷,即会陷入局部极小的特点,在基于聚类的异常值检测的思想下,以聚类的方式寻找异常值并将其删除。算法引入了熵和平衡的概念,作为算法终止的一种条件。为了防止K-MEANS算法陷入某个局部极小而应用了一种类似刺激的机制,即利用类似欠阻尼曲线的变化形式来控制聚类数目的改变,以使K-MEANS算法在陷入某个局部极小而无法继续寻找异常值的时候能够跳出该局部极小,在不断的聚类过程中,能够继续寻找并删除异常值聚类,从而减小K-MEANS聚类算法受异常值的影响,有效地提高了算法寻找聚类中心的能力和聚类的准确率。

【Abstract】 The K-MEANS clustering algorithm is a widely used simple iterative method to partition a given dataset into a user-specified number of clusters, k. Its practical value and importance have been acknowledged across different disciplines. The traditional K-MEANS take Euclidean distances of each observation as the measurement, and error square sum the objective function. An outlier is such an extreme observation that is numerically distant from the mean of data sample, which causes all the statistical tests based on mean and variance to distort to some extent. However, a small number of outliers not due to any anomalous condition are to be expected in large samples. Thus K-MEANS inevitably is impacted by the existence of outliers.This paper researches on the algorithm and measurement of K-MEANS, then proposes an optimization of K-MEANS algorithm based on outlier deletion. The main point is that the defect that K-MEANS would fall into a suboptimization is made advantage in our algorithm. Under the strategy of cluster-based outlier detection, outliers are searched and deleted in clusters. The notion of entropy and balance are invited as a condition to end the clustering process. To avoid K-MEANS from falling into certain suboptimization and ceasing searching for outliers, a mechanism of stimulation is introduced. The number of clusters, k, is changing during the deletion process, following a certain curve. The aim of changing k is to kick iterative process out of the suboptimization which as a result would help continue the outlier deletion process. Thus outliers would be searched and deleted as much as possible. The ability to find cluster centers and cluster correctly is raised effectively after decreasing the influence outliers have on K-MEANS algorithm.

  • 【分类号】TP311.13
  • 【被引频次】1
  • 【下载频次】194
节点文献中: 

本文链接的文献网络图示:

本文的引文网络