节点文献

基于蚁群算法的离群点挖掘算法研究

Research on Outlier Detection Algorithm Based on Ant Colony Algorithm

【作者】 李红娜

【导师】 任家东;

【作者基本信息】 燕山大学 , 计算机应用技术, 2010, 硕士

【摘要】 离群点挖掘随着数据挖掘的发展引起了广泛关注。通过对国内外离群点挖掘算法的研究情况分析可知,以往的离群点挖掘算法还存在诸多问题,例如用户定义的阈值往往直接影响着挖掘的结果;考查多变量之间的相似性来挖掘时序离群点的算法仍较少,或精确度较低。针对这些问题,本文主要研究了基于蚁群算法的离群点挖掘方法。首先,提出了一种在对蚁群构图进行切割的基础上挖掘离群点的算法。该算法在第一阶段对传统的蚁群算法进行改进,将不同属性数据之间的距离和分布情况纳入转移概率的计算之中,从而构建最优的图像。然后在一定的图像切割准则下对图像进行切割,最后通过计算各个簇,即切割图像后形成的各子图之间的差异以及同一簇中数据点之间的差异来找到top n离群点。其次,提出了一种基于改进的蚁群k-means聚类算法的多变量时序离群点挖掘算法。该算法把蚁群算法特有的信息素和转移概率引入对数据聚类的过程中,通过计算类内距离和类间距离找到符合聚类标准的最好聚类结果,然后通过查看各数据点在不同簇中的时刻点分布情况,以邻居相似性为标准计算各点的离群系数,从而实现时序离群点的挖掘。最后,在真实和合成数据集上对提出的两种算法进行了验证。实验结果表明,提出的算法在对离群点的检测精度上要明显优于其他同类算法,实现了预期的研究目标。

【Abstract】 Outlier detection has obtained the widespread concern with the increasing application of data mining. We can know that there are still some problems in the area of outlier detection by analyzing the existed algorithms of domestic and international. The user-defined thresholds often affect the result of outlier detection to a large extent and it is such a difficult task to define them for users who don’t have previous knowledge. Few algorithms mine outliers on the basis of similarity of neighbors when mining outliers on multivariate time series data, but obviously this is a very promising area. There is still another problem: the precision is still low in outlier detection. To solve these problems we mainly study the methods of outlier detection based on ant colony algorithm.Firstly, a method of graph-cut based outlier detection using ant colony algorithm is proposed. In the first phase we make some improvements to the traditional ant colony algorithm, for example, to calculate the more accurate transition probability, we take the distance on categorical and numerical attributes as well as the local distribution situation into consideration; in this way a better graph can be constructed in limited time. Then we cut on the graph under a criterion and every subgraph can be treated as a cluster consequently. At the last step we can get top n outliers by comparing the difference of clusters and the difference of points that lie in the same cluster.Secondly, a method of outlier detection on multivariate time series data based on improved ant colony algorithm and k-means clustering algorithm is proposed. In the process of classifying the time series data we introduce the concepts of pheromone and transition probability of the ant colony algorithm into the k-means clustering algorithm, and then try to find the best clustering result which is determined by the distance of inner-clusters and inter-clusters. At the last step we calculate the outlier score of every data point by checking the distribution of the time points that lie in clusters under the criterion of neighbor-similarity.In the end we test the two algorithms proposed in this paper on both of real-world datasets and synthetic datasets. The experimental results of the proposed algorithms show the superiority on the precision of outlier detection than other existed algorithms. And the expected goal is achieved.

  • 【网络出版投稿人】 燕山大学
  • 【网络出版年期】2012年 02期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络