节点文献

面向非平衡混合型数据的分类算法及应用研究

【作者】 陈宇宙

【导师】 廖志芳;

【作者基本信息】 中南大学 , 计算机应用技术, 2008, 硕士

【摘要】 非平衡混合数据分类处理在现实应用中非常普遍,该数据具有分布不均匀,属性多样等特性。传统的分类学习方法在处理该类型数据时有效性不高,而且在少数类样本足够重要时,甚至会导致较大的损失,因此针对非平衡混合数据的处理方法成为当前国内外数据挖掘研究的重点之一。本文的研究工作以传统的分类方法为基础,通过对传统分类算法的改进,实现对非平衡混合数据的处理。通过分析发现计数最近邻分类算法(K—nearest Neightbours By Counting,CwkNN)可以有效地对混合型数据进行分类,但该算法对非平衡性数据处理效果不理想。本文在CwkNN算法的基础之上结合数据的非平衡性特点提出了三种改进的分类方法,分别为如下所述:(1)全局密度分类算法:针对CwkNN算法不能处理非平衡型数据的特点,引入一个全局密度,重新平衡数据对分类的影响度。实验发现提高了少数类样本的分类精度,降低了多数类样本的分类精度。(2)K—局部密度分类算法:针对全局密度分类算法降低了多数类样本的分类精度,引入K-局部密度,保证在提高少数类样本分类精度的同时,不会降低多数类样本的分类精度,实验证明该方法有效地提高了非平衡型数据的分类精度。(3)基于密度的边界点检测及分类算法:针对数据中的边界点,提出了基于密度的边界点检测方法,并对检测出来的边界点采用边界点三种分类方法进行分类。实验证明通过这些方法对存在边界点的非平衡数据可进行正确分类。

【Abstract】 The processing of the imbalanced mixed data is very commom in the real world, Such data are unevenly distributed, and diversity of attributes. The effectiveness of traditional classification learning methods is not high in dealing with this type of data, and if the minor samples is sufficiently important, it may lead to greater losses. So against non-equilibrium mixed data processing methods have become one of the focal point of the current domestic and international data mining research.The main research work of this paper is on the basis of traditional classification methods, through improving the traditional methods, achieve non-equilibrium mixed data processing. It was found that k-nearest neightbours by counting can be effective in the mixed data classification by analyzing the algorithm, but the effectiveness of the algorithm are not satisfactory for non-equilibrium data processing. So this paper proposes three improved classifying methods by combining the characteristics of imbalanced data with CwkNN algorithm, were as follows:(1) The overall density classification algorithm: Against the characteristics of the CwkNN algorithm can not handling non-equilibrium data, the introduction of a overall density, re-balancing of data on the impact of the classification. It was found that the minor samples increase the accuracy of the classification, and the majority samples reduce the classification accuracy through experiments.(2) K—local density classification algorithm:Aim at the overall density classification algorithm reducing the classification accuracy of the majority samples, the introduction of a K—local density to ensure that the minor samples will improve the accuracy of classification, and the majority samples will not reduce the classification accuracy at the same time. It was found that the effective increase in imbalanced type of data classification accuracy through experiments.(3) The boundary points detection and classification algorithms based on the density: Aim at the boundary points in the data, the paper proposed a boundary points detection method based on the density, and use the three kind of classification methods of boundary points to classify boundary points detected. Experiment prove that these method can classify the non-equilibrium data with boundary points correctly.

  • 【网络出版投稿人】 中南大学
  • 【网络出版年期】2009年 01期
  • 【分类号】TP301.6
  • 【被引频次】4
  • 【下载频次】151
节点文献中: 

本文链接的文献网络图示:

本文的引文网络