节点文献

面向不平衡数据的支持向量机分类方法研究

Research on Support Vector Machine Classification Method for Imbalanced Datasets

【作者】 杨智明

【导师】 彭喜元;

【作者基本信息】 哈尔滨工业大学 , 仪器科学与技术, 2009, 博士

【摘要】 支持向量机是一种建立在统计学习理论基础上的机器学习方法,同神经网络等传统方法相比,能够较好地解决高维数、非线性、局部极小等实际问题,已成为机器学习领域的研究热点。支持向量机具有坚实的理论基础,即使在样本数量较少的情况下依然能够获得良好的应用效果,比较适合解决故障诊断等典型的小样本学习问题。因此,将支持向量机应用于故障诊断具有重要的理论意义和工程实用价值。通常情况下,支持向量机在诊断数据集呈现平衡分布时能够取得良好的诊断效果。然而在实际应用中由于故障数据难以获得,诊断数据集常表现为严重的不平衡特性。研究表明,支持向量机在样本数量不平衡情况下分类效果不佳。因此本文针对支持向量机在不平衡数据集上分类准确率降低的问题展开研究,重点研究基于数据预处理的不平衡数据分类方法和基于支持向量机算法改进的不平衡数据分类方法,并将其应用于模拟电路故障诊断中,解决了实际过程中由于训练数据严重不平衡造成的分类准确率下降的问题,提高了支持向量机方法的适用范围。论文的主要研究工作包括以下几方面的内容:1.原始SMOTE方法在产生新的少类合成样本的过程中,没有考虑少类样本真实的分布特性,也没有考虑少类样本附近多类样本的分布情况,存在一定的盲目性。针对该问题本文提出了一种改进的SMOTE方法——自适应SMOTE。该方法根据数据集内部样本分布特性,自适应地调整原始过采样方法中近邻选择策略,控制合成样本的质量。仿真实验表明,采用该方法对数据集进行预处理能够有效提高支持向量机分类方法的分类性能。2.单边采样技术等传统样本集修剪方法在处理边界样本的过程中简单地将边界样本从样本集中删除,造成分类信息的部分丢失。针对该问题本文提出了一种基于K-近邻方法的模糊样本集修剪技术;针对传统随机欠采样方法中存在的分类信息丢失严重的问题,提出了一种基于非监督学习方法的指导型欠采样技术。使用以上两种方法对数据集中多类样本进行欠采样处理,能够有效缓解样本集不平衡对支持向量机分类方法造成的不良影响。3.在SVM算法改进方面,本文首先详细分析了支持向量机在不平衡数据集上针对少类样本分类准确率较低的本质原因,并以此为基础提出了一种改进的支持向量机方法——μSVM。这种方法通过引入新参数μ调整分类决策函数中的距离度量准则,使分类超平面向多类样本倾斜,增大少类样本的决策空间,提高了少类样本的分类准确率。4.支持向量机的理论基础是使用非线性映射将样本映射到高维特征空间使其线性可分,然而在实际应用过程中常常难以获得高维空间中具体的映射关系,对高维特征空间的几何结构缺乏本质的认识,因此难以在特征空间中对支持向量机进行有效的改进来处理不平衡数据分类问题。针对这一问题,提出了一种新型支持向量机改进方法——BEF-SVM。该方法使用偏置判别分析准则作为核优化的目标函数,在经验特征空间中增大不平衡数据集的类可分性,从而获得最佳的整体分类准确率。5.在电路故障诊断应用研究方面,以两个典型电路作为诊断对象,在PSPICE软件环境下进行仿真,产生电路的输出波形,并使用Haar小波变换等数据预处理技术提取电路特征,以此为基础设计基于SVM的电路故障诊断系统。针对电路故障诊断实用化过程中存在的样本数据不平衡问题,在电路仿真过程中以不同的设置参数和采样比例产生正常样本和故障样本,使用文中提出的几种方法解决样本集中存在的不平衡问题,最终设计适合于解决电路故障诊断实际问题的支持向量机分类方法。

【Abstract】 Support Vector Machine (SVM) is a kind of machine learning method based on statistical learning theory. Compared with traditional methods such as neural network, SVM can solve many practical problems such as high dimension, nonlinearity and local minima. So it has become a hot issue in the field of machine learning. SVM has strong theoretical foundation and can get excellent generalization ability even if the number of training sample is small. Therefore it is suitable to solve fault diagnosis problem, which is a typical limited sample learning problem. So research on fault diagnosis method based Support Vector Machine has strong theoretical significance and practical engineering meaning.In general, when the diagnosis dataset is balanced distributed, SVM can get desirable result. However, in practical application, fault samples are hard to acquire, which makes the diagnosis dataset highly imbalanced. And it is found that the classification accuracy of SVM for fault sample is much worse than that for normal sample which limits the practical application of SVM for circuit fault diagnosis problems. This dissertation aims at solving the problem that SVM cannot get desirable results for classification on imbalanced datasets. Reseach work includes two main aspectes: the data pre-processing method for imbalanced dataset and SVM modification method for imbalanced datset. Then we apply these methods in analog circuit fault diagnosis field and solve problem of SVM classification accuracy deterioration caused by imbalanced diagnosis dataset in practical application.The main innovative contributions of this dissertation are as follows.1. Synthetic Minority Oversampling TEchnique (SMOTE) is an effective over-sampling technique, but in the process of synthetic sample generating, SMOTE doesn’t consider the true distribution of minority samples and it doesn’t consider the distribution of majority sample in the neighborhood of minority sample either, so it is of some blindness. Therefore, a new kind of over-sampling technique——ASMOTE is proposed. Based on the distribution of the dataset, ASMOTE adjusts the neighbor selective strategy of SMOTE in order to control the quality of new samples. Simulation results show that after preprocessing the dataset by ASMOTE, classification accuracy of SVM classifier is highly improved.2. In the process of boundary data processing, traditional sample cutting technique such as one-sided selection simply removes the boundary samples from the datasets, which makes loss of classification information. For this problem, the dissertation proposes Fuzzy Sampling Cutting Technique based on K-nearest neighbor method. For the classification information loss problem occurred in traditional random undersampling method, the dissertation proposes Guided Undersampling Technique based on unsupervised learning. Experimental results show that after preprocessing datasets by the above two methods, classification accuracy of SVM for imbalanced datasets will be highly improved.3. SVM can be ineffective in classifying the minority sample when it is applied to the problem of learning from imbalanced datasets. In order to design proper SVM modification method to remedy this problem, the dissertation analyzes the true cause of that problem firstly. Then based on this, a kind of SVM modification method——μSVM is proposed. In the new method, the decision region of the minority class is enlarged by adjusting the distance measurement rule in the classifying decision function. Empirical study shows thatμSVM can augment the classification accuracy rate effectively.4. SVM’s theoretical foundation is based on the nonlinear mapping from input space to a high-dimensional feature space to make the dataset linear separable, and it is very hard, sometimes impossible, to acquire the form of this nonlinear mapping. So it is difficult to implement effective modification on SVM in feature space to make it suitable to solve imbalanced classification tasks. For this problem, the thesis proposes a new kind of SVM modification method——BEF-SVM. BEF-SVM uses Biased Discriminant Analysis criterion to measure class separability for imbalanced datasets in the process of kernel optimization, so that the class separability will be enlarged, which in turn improves the prediction accuracy for minority samples.5. For the practical application research on fault diagnosis, the dissertation selects two typical circuits as diagnosis target and simulates the output waveform in PSPICE environment. Then we apply a three stage data-preprocessing method which includes Haar wavelet transform, PCA method and data normalization to extract feature from the circuits. Then these features are used to develop fault diagnosis system based on SVM. For the imbalanced classification problem occurred in practical circuit fault diagnosis application field, different setting parameters and sampling rate are applied in simulation process to generate normal samples and fault samples, then the imbalanced dataset classification methods proposed in the dissertation is applied to solve this imbalance problem. Finally the SVM classification method which is suitable to solve practical analog circuit fault diagnosis problem can be developed.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络