节点文献

面向非均衡数据集的机器学习及在地学数据处理中的应用

Research of Machine Learning on Imbalanced Data Sets and Its Application in Geosciences Data Processing

【作者】 谷琼

【导师】 蔡之华;

【作者基本信息】 中国地质大学 , 地学信息工程, 2009, 博士

【摘要】 分类是数据挖掘和知识发现的重要任务之一,传统的机器学习分类研究大多基于如下假设:(1)以高总体分类正确率为目标;(2)数据集中的各类样本数目基本均衡;(3)所有的分类错误会带来相同的错误代价。基于这些假设,人们研究了大量的分类算法如决策树算法、贝叶斯分类、人工神经网络、K-近邻算法、支持向量机、遗传算法等,并将其广泛应用于医学诊断、信息检索、文本分类等众多应用领域。然而,真实世界的分类问题存在很多类别非均衡的情况,数据集中某个类别的样本数可能会远多于其他类别。在这些情况下,分类器通常会倾向于将测试样本全部判别为大类而忽视小类样本,这使得到的分类器在小类样本上效果会变得很差。不平衡数据集自身的特点(少数类数据的绝对缺乏和相对缺乏、数据碎片、噪声)以及传统分类算法的局限性(不恰当的评价标准和不恰当的归纳偏置)是对不平衡数据集进行准确可靠分类的关键制约因素。因此,对不平衡数据集的分类问题已成为机器学习和模式识别领域中新的研究热点,是对传统分类算法的重大挑战。目前,针对非均衡数据集分类性能提高的解决方法主要围绕数据层面和算法层面来开展。通过数据重取样的方法包括过取样和欠取样两类来改变不平衡数据的类分布以降低数据的非均衡程度可提高分类性能;改进已有的分类算法如代价敏感学习、支持向量机算法、单类学习和集成学习等,通过调节各类样本之间的代价函数、对不同类的样本设置不同的权值、改变概率密度、调整分类边界等措施使其更有利于少数类的分类来提高分类性能。然而,目前的处理手段和改进方法在对少数类的分类性能上尽管都有一定程度的改善,但仍旧存在过学习或多数类重要信息损失等问题,分类结果的可靠性会受到一定的影响。因此,在尽可能不降低总体分类性能的前提下,提高少数类分类性能,从而合理运用非均衡数据集的分类结果进行准确的预测仍是一个值得进一步研究的课题。本论文针对传统的机器学习分类的三个假设,从算法的改进发展和其实用性验证两大方面展开了系统深入的研究。首先对非均衡数据集的分类性能的评估方法和评价指标进行了详细讨论。进一步的,从数据层面上,在已有算法的基础上对非均衡数据集的重取样算法作了两项关键的改进,并将所提出的算法用于地学领域的数据分类预处理中;从算法层面上,实现了将重构数据集和基于误分类代价最小的算法改进两种方法的有机融合。论文的主要工作和结论如下:一、非均衡数据集分类性能评估、算法的改进与发展1、非均衡数据集的分类性能评估讨论了传统机器学习分类研究的第一条基本假设的合理性,即高的总体正确率为分类目标是否适用于对非均衡数据集分类性能进行评估。正确地评价一个分类系统的性能,对选择分类特征和分类器参数都有重要的指导作用,因此如何检验分类系统性能是很重要的一环。分类器的评估方法和评价指标很多,不同的分类方法可能会偏好某些评估指标,即对分类方法的改进也是基于某一种标准上的改进。建立或设计更先进的算法来解决机器学习的分类问题受到众多学者的重视,然而机器学习结果的评估与算法的改进其重要性至少是相当的,是数据挖掘能否取得真正进展的关键之处。本文对经典的分类技术和常用分类的评估方法、评价指标进行了系统的讨论,并分别对数值型评价指标和图形评价指标进行了分析和比较,指出某些评价指标在面对非均衡数据集分类的性能评价时可能存在一定的问题,从而较难对分类结果做出正确的判断和决策。此外,论文还探讨了一些其他复合数值型评价指标,这些指标亦可用于非均衡数据集的分类性能评估。实际上,没有任何评价指标可以适合于所有的分类问题,盲目地确定某一个指标作为评价标准并不是一个好的策略。这也是分类器设计中常见的具体问题,选用哪个分类评价指标将更依赖于分类器的应用背景或用户的需求。根据不同的情况应该选择合适的评价指标,才能有助于我们对算法的分类性能做出正确的评价与判断。2、非均衡数据集的重取样算法针对传统的机器学习分类研究的第二条“数据集中的各类样本数目基本均衡”的基本假设进行了非均衡数据集分类的研究。论文提出了两种类型的混合重取样算法,即通过将过取样技术和欠取样技术将结合的方法,使非均衡数据集在分类前达到基本均衡。第一种是自适应选择近邻的混合重取样算法(Automated Adaptive Selection of the Numberof Nearest Neighbors of Hybrid Re-Sampling,ADSNNHRS),该算法分为两部分,过取样部分解决了SMOTE(Synthetic Minority Over-sampling Technique)算法在产生合成样本过程中存在的盲目性、只能复制生成数值型属性等问题,能够根据实例样本集内部分布的真实特性,自动适应调整选择SMOTE方法中的近邻选择策略,并对具有混合型属性的数据集采用不同的复制方法生成新的实例,从而有效地控制和提高合成样本的质量;欠取样部分通过对合成之后的实例集用改进的邻域清理方法进行欠取样,去掉了多数类中的冗余实例和边界上的噪音数据。本论文所提出的方法实际上结合了过取样和欠取样两种方法的优势,一方面通过自适应选择近邻的方法增加少数类样本的方式强调了正类,另一方面对多数类进行适当程度的欠取样,减少其规模,达到多数类和少数类样本在一定程度上的相对均衡,从而可以有效地处理非均衡数据分类问题,提高分类器的性能。第二种是基于Isomap降维混合重取样算法(Hybrid Re-Sampling based on Isomap,HRS-Isomap),即将非线性降维和混合重取样算法相结合,来降低数据的不平衡性。论文研究了两种类型的常用数据降维方法,线性数据降维方法,如主成分分析法(Principal ComponentAnalysis,PCA)、多维尺度分析(Multidimensional Scaling,MDS)和非线性数据降维方法,如等距离特征映射(Isometric feature mapping,Isomap)、局部线性嵌入(Locally Linear Embedding,LLE)等;并分别将两种经典的降维方法用于地学数据的处理中,通过对地学数据分类前的预处理,简化模型的结构,从整体上提高模型的预测性能。在此基础上,针对SMOTE算法基于空间上任意两个少数类样本点之间的样本点也属于少数类这样一个在实际情况下(尤其当数据集非线性可分时)不一定正确的假设,提出将非线性降维Isomap算法和混合重取样算法相结合,先利用等距离特征映射算法(Isomap)将初始数据集进行非线性降维,然后再通过合成少数类过抽样算法(SMOTE)在降维后更加线性可分的数据上过取样,再对过取样后的数据集进行邻域清理的欠取样,来降低数据的不平衡性,得到基本均衡的低维数据。对非均衡数据集进行非线性降维后,其分类性能有较大程度的改善,各项评价指标均有不同程度的提高,特别是对非线性降维后的数据再进行混合重取样,少数类的F-measure值提高显著,在少数类分类性能显著上升的情况下,整体分类性能也有不同程度的提高。说明将非线性降维Isomap方法引入到非均衡数据的重取样处理中是行之有效的。Isomap的强降维和发现数据本质结构的能力给我们提供了一个解决非均衡数据集分类问题的新思路。3、非均衡数据集的代价敏感学习算法围绕解决传统的机器学习分类研究的第三条基本假设,即所有的分类错误会带来相同的错误代价来展开讨论。基于大多数研究只是集中于纯非均衡数据集分类学习或者纯代价敏感学习,而忽略了类分布非均衡往往和不等错误分类代价同时发生这一事实,本论文尝试在原有的代价敏感学习算法中将重构数据集和基于误分类代价最小的算法改进两种不同类型的解决方法融合在一起,一方面先用样本类空间重构的方法使原始数据集的两类数据达到基本均衡,另一方面,分类基于最小误分代价而非最小错误率,对所关心的类别赋以较大的代价,其他类则赋以较小的代价,然后再用代价敏感学习算法进行分类。当通过使用样本空间重构的方法使类分布变得相对均衡且选择合适的代价因子时,基于最小误分类代价的代价敏感学习算法的分类结果明显优于其他的分类算法,不但少数类的分类性能大幅上升,整体的分类性能也有一定程度的提高。二、非均衡数据集分类的方法在地学领域中的应用及分析本论文将所发展的自适应选择近邻的重取样算法用于岩爆危险性预测工程。岩爆的统计结果是一种典型的非均衡数据集,传统的数据挖掘分类算法很难得到精确的预测结果。实际上,岩爆现象中的少数类实例才是真正需要关注的对象,并期望获得较高的预测精度。论文利用南非科学研究院建立的VCR采场岩爆实例数据库,通过人工生成部分少数类实例作为训练数据进行仿真实验,预测的岩爆危险性状态与实际情况完全一致。这说明本文提出的重取样方案在工程实例岩爆危险性的实例数据非均衡情况下是可行的,预测准确率高,具有良好的工程应用前景。该方法不必建立复杂的数学方程或计算模型,输入数据客观存在或易于量测的,具有实现简单的优点。采用该方法可以找到岩爆发生的主控因素,可为深部开采工程的合理设计与安全施工提供科学依据。论文的主要创新点如下:1、提出了两种类型的混合重取样算法。针对经典的过取样算法SMOTE产生合成样本的过程中存在的问题和不准确的假设,分别提出了自适应选择近邻的混合重取样算法ADSNNHRS和基于Isomap非线性降维的混合重取样算法HRS-Isomap,这两种混合重取样算法均可有效地处理不平衡数据分类问题。2、提出了一种新型的不均衡数据集的代价敏感学习算法。针对数据集类分布不均衡及其错误分类之后可能造成不同的误分类代价这两种情况可能同时发生这一事实,将二种不同类型解决非均衡数据集的分类方法样本类空间重构和基于误分代价最小的代价敏感学习算法有机地融合在一起,其分类结果明显优于其他的分类算法。3、在地学领域中引入非均衡数据集的处理解决方法。针对大量地学数据存在着不确定性、经验性、间接性、不完整性及类分布非均衡等特点,将降维方法灵活地用于高维地学数据的预处理中,并在地学数据分析领域中引入非均衡数据的机器学习概念、模式和解决方法,为有效地处理海量地学数据、提高地学数据分析的自动化和智能化水平提供了一套有力的分析工具。

【Abstract】 Classification is an important mission of data mining and knowledge discovery in databases. Conventional machine learning classification technologies assumed that, maximizing whole accuracy is the goal of classification, the classifier operates on data drawn from the same distribution as the training data, and the misclassification at any situation brings the same error costs. Based on such assumptions, large amounts of classification algorithms, such as decision tree, Bayesian Classification, artificial neural network, K-nearest neighbor, support vector machines, genetic algorithm, and the newly reported classification algorithms, have been developed and successfully applied to many fields, such as medical diagnoses, information retrieval, text classification, and etc. However, the assumptions always failed to deal with the imbalanced data sets (IDS) in real problems, where one class might be concentrated in a large number of samples and the other classes own very few. Most classification algorithms pursue to minimize the error rate by ignoring the differences between types of misclassification errors cost and consequently yield poor predictive accuracy over the minority class. The major difficulties of IDS classification lie on the feature of the data sets themselves (lack of absolute/relative data of the minority class, data fragmentation, noise, etc.) and the limitations of conventional classification algorithms (improper evaluation metrics and inappropriate inductive bias). Consequently, classification on IDS becomes a hot topic of machine learning and pattern recognition, and it presents a great challenge for conventional classification algorithms.In the last decades, many efforts have been performed to improve the classification performance towards the minority class. Two general approaches are currently available to tackle the imbalanced data classification problems. One approach is based on data level, known as data set reconstruction or re-sampling. By using under-sampling of the majority class or over-sampling of the minority class or combining both of the two techniques to reduce the degree of class distribution imbalance, the classification performance towards the minority class can be improved in a certain extent. Another approach is based on algorithms level aiming to modify the existing data mining algorithms or develop new ones such as Cost Sensitive Learning (CSL), Support Vector Machine, One-Class Classification, and ensemble learning methods. Through revising of cost factor, setting different weights according to specific samples, changing probability density function and adjusting the decision border, one can also improve the classification performance towards the minority class. However, although improvements are achieved, problems such as loss of important information of majority class and over-fitting when dealing with IDS still await to solve, which will decrease the reliability of predicted results. Therefore, under the condition of reserving the whole classification performance, how to improve the performance towards the minority class samples and consequently to attain accurate predictions according to the classification results is still a topic that well worth further studying.Centering on this topic and starting from the three basic assumptions, we present deep and systematic investigation of developing of several novel algorithms and reliability validation of these algorithms for IDS in this thesis. As a first step, the assessment methods and evaluation measures of the classification performance were thoroughly discussed. Then we proposed two vital amelioration of re-sampling of IDS based on the existing SMOTE over-sampling algorithms at the data level, and these techniques were applied to preprocess of geosciences data sets to validate their reliability; at the algorithm level, we combined the re-sampling technique and CSL technique based the minimal total misclassification cost together to achieve better classification performance. The main efforts and conclusions of this thesis are listed below:1. Classification Performance Evaluation and Algorithm Development of IDSA) Assessment methods and evaluation measures of the classification performance of IDSWhether a high whole accuracy can serve as the evaluation measure of IDS classification or not was discussed firstly. Assessment methods and evaluation measures of classification performance play a critical role in guiding the design of classifiers. There are many assessment methods and evaluation measures each have its individual advantages and disadvantages. Thus the modification of classification algorithms in some extent equals the improvement of criterions. Many efforts have been conducted to design/develop more advanced algorithms to solve the classification problems. In fact, the assessment methods and evaluation measures are at least as important as algorithm and is the first key stage to a successful data mining. We systematically summarized the typical classification technologies, the general classification algorithms, the assessment methods and evaluation measures of IDS. Several different type performance measures, such as numerical value measure and visualizing classifier performance measure, have been analyzed and compared. The problems of these technologies and measures towards IDS may lead to misunderstanding of classification results and even wrong strategy decision. Beside that, a series of complex numerical evaluation measures were also investigated which can also serve for evaluating classification performance of IDS.In general, there is no a generalized evaluation measure for various kind of classification problems. A good strategy to identify a proper evaluation measure should largely depend upon specific application requirement. Choose appropriate evaluation measure according to different background can help people make correct judgment to the algorithm classification performance.b) Resampling algorithm of IDSWe proposed two new hybrid re-sampling techniques based on the improved SMOTE over-sampling algorithm. By combining the over-sampling technology and the under-sampling technology together, the IDS evolve to balance before classification.The first technique is the automated adaptive selection of the number of nearest neighbors of hybrid re-sampling method. In the SMOTE method, blindfold new synthetic minority class examples by randomly interpolating pairs of closest neighbors were added into the minority class; and data sets with nominal features cannot be handled. In our procedure of over-sampling, these two problems were solved by the automated adaptive selection of nearest neighbors and adjusting the neighbor selective strategy. As a consequence, the quality of the new samples can be well controlled. In the procedure of under-sampling, by using the improved under-sampling technique of neighborhood cleaning rule, borderline majority class examples and the noisy or redundant data were removed. This method in fact combined the improved SMOTE and the NCR data cleaning methods. The main motivation behind these methods is not only to balance the training data, but also to remove noisy examples lying on the wrong side of the decision border. The removal of noisy examples might aid in finding better-defined class clusters, therefore, allowing the creation of simpler models with better generalization capabilities. and therefore, promising effective processing of IDS and a considerably enhanced classifier performance.The second technique is the Isomap-based hybrid re-sampling method. The method attempts to reduce the degree of imbalanced class distributions through combining the Isomap nonlinear dimensionality reduction method with the hybrid re-sampling technology. We first analyzed two methods for the most general linear (principal component analysis and multidimensional scaling) and nonlinear (Isometric feature mapping and Locally Linear Embedding) dimensionality reduction algorithms. These two technologies were sequentially utilized to preprocess geosciences data and to reduce the dimensionality of the feature space. The structure of classification model was thus simplified and the whole classification performance was highly improved. SMOTE is an approach by over-sampling the minority class. However, it is limited to a strict assumption that the local space between any two minority class instances is minority class instance or belongs to the minority class, which may not be always true in the case when the training data is not linearly separable. We present a new re-sampling technique based on Isomap. The Isomap algorithm is first applied to map the high-dimensional data into a low-dimensional space, where the input data is more separable, and thus can be over-sampled by SMOTE. The over-sampled samples were then under-sampled through NCR method yielding balanced low-dimensional data sets. By such a procedure, the evaluation measures were sequentially promoted and the classification performance is considerably improved, especially the F-measure of minority class. In fact, both the whole and the minority class classification performance were improved simultaneously. The underlying re-sampling algorithm is implemented by incorporating the Isomap technique into the hybrid SMOTE and NCR algorithm. Experimental results demonstrate that the Isomap-based hybrid re-sampling algorithm attains a performance superior to that of the re-sampling. It is clear that the Isomap method is an effective mean of reducing the dimension of the re-sampling, which provides a new possible solution for dealing with the IDS classification.c) CSL algorithm of IDSWe first discussed the misclassification cost problems centering on the third assumption of conventional machine learning. Most studies focused on the IDS classification or cost-sensitive learning systems themselves; however, the fact that imbalanced class distribution and misclassification errors cost unequally always occurring simultaneously was neglected. We attempted to combine the re-sampling and the CSL techniques together in order to solve the misclassification of IDS. On one aspect, the re-sampling technique allows balanced data sets by reconstructing both the majority and the minority class. On the other aspect, the classification was performed based on minimal misclassification cost but not the maximal accuracy. Here the misclassification cost for the minority class is much higher than the misclassification cost for the majority class. Cost-sensitive learning procedure was then conducted for classification. Using appropriate cost factor and balancing the data sets through re-sampling technology, our CSL algorithm based on the minimal misclassification cost performs much better than the currently available classification techniques. Not only is the classification performance of minority class improved significantly, but the overall classification performance is enhanced in a certain extent.2. Application and Analysis of Our Classification Algorithm of IDS in GeosciencesThe automated adaptive selection of the number of nearest neighbors of re-sampling method was applied to study the fatalness prediction engineering of rockburst. The statistic data of large amounts of rockburst is a kind of typical IDS. It is very difficult to give an accurate prediction using conventional classification methods. In fact, we mostly concern the minority class other than the majority class and high prediction accuracy is always desired. In this thesis, the VCR rockburst database provided the Academy of South Africa was employed as a sample IDS for classification and prediction. By adding extra artificial minority class samples as the expanded training set. experimental simulation was performed, which yields exactly consistent prediction results with the actual situation. Promisingly, the re-sampling method and classification scheme we developed is feasible and reasonable for applications of IDS from engineering. It is unnecessary to build complicate mathematic equation or computer models for our algorithms and the input data sets can be easily measured or obtained, thus this method can be readily implemented to determine the controlling factors of engineering. Such a prediction can provide reasonable and sufficient guidance to design a safe construction scheme of in deep mining engineering.The major innovation and contribution of this thesis are listed as follows:a) We developed two types of hybrid re-sampling algorithms. Aiming to the problems and improper assumptions of SMOTE algorithm, we proposed the automated adaptive selection of the number of nearest neighbors of hybrid re-sampling algorithm and the Isomap-based hybrid re-sampling algorithm, respectively. Both the two algorithms can effectively deal with the IDS classification.b) We proposed a novel CSL algorithm on IDS. Aiming the fact that imbalanced class distribution and misclassification errors cost unequally always occurring simultaneously. We proposed combined methods of the re-sampling and the CSL techniques together in order to solve the misclassification problem of IDS. The combination algorithm intergrades the advantanges together and thus can perform much better than existing methods.c) We introduced the IDS processing methods to analyze the geoscience data. Due to the characteristics such as uncertaincy, empiricism, oblique, incomplete and imbalanced class distribution of geoscience data, we employed the dimensionality reduction method to preprocess the data firstly and then utilized the effective classification methods towards IDS to virtually process huge amount of geoscience data. Such a analytical scheme would be very powerful for the automatic and intellegent analysis of geoscience data.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络