节点文献

异常检测方法及其关键技术研究

Research on Outlier Detection Method and Its Key Techniques

【作者】 陈斌

【导师】 陈松灿;

【作者基本信息】 南京航空航天大学 , 计算机应用技术, 2013, 博士

【摘要】 所谓异常检测就是检测和发现观测数据中不符合正常(期望)行为的异常数据模式,根据应用领域的不同,这些异常模式也被称为野值点、不一致点、新颖点、离群点或者污点。近年来异常检测已广泛用于故障诊断、疾病检测、入侵检测、信用卡(或保险)欺诈检测及身份辨识等领域。在这些领域中,异常模式常常蕴含了显著的(通常具有很大危害甚至致命的)行为信息,如互联网中网络流量(行为)的异常可能意味着受攻击主机上敏感信息的泄密,信用卡的欺诈行为会导致巨大的经济损失。因此异常检测的研究极具理论意义和实用价值,并已得到了广泛的关注,成为了模式识别领域中一个非常活跃和热门的研究方向。异常检测任务的特殊性往往在于只有符合期望(正常类)行为的数据模式,而罕有或未知违反符合期望(异常类)行为的数据模式,此两类观察样本的极端不平衡性(异常类样本数远小于正常类样本数)使得异常检测非常困难。因而目前对异常检测方法的研究主要集中于无监督学习框架和一些利用极少数有标号异常样本的监督学习方法。本文针对各种异常检测方法的原理、鲁棒性和先验信息嵌入等方面进行了深入研究,主要工作如下:1.提出了基于单簇聚类的数据描述OCCDD (One-cluster Clustering based Data Description),其利用单簇类聚类算法可能性C-均值PCM (Possibilistic C-Means)即P1M(PCM,C=1)进行权值计算并采用加权平均方法求解包含超球,克服了SVDD (Support Vector Data Description)采用极小极大化估计包含大多数正常类样本超球时超球中心对野值点的不鲁棒性,避免了SVDD求解二次规划的高训练复杂性。并从理论上证明了P1M拥有PCM(C>1)一般不具备的全局最优特性。进一步针对文本分类等应用中自然形成的观测数据的多视图特性,对OCCDD进行拓展,提出了一种多视图的异常检测方法,不同于单个视图上的单独训练,其实现了多视图的同时学习和相互促进。2.提出了AUC (Area under the ROC curve)正则化的SVDD,其针对异常类样本分布在正常类样本四周的情形,利用AUC度量对样本分布和错分代价的不敏感性,将AUC度量作为正则化项嵌入到SVDD优化目标中,从而同时优化最小包含球体积和AUC性能,解决了一般异常检测器不能胜任存在极少异常类样本的极端不平衡样本分布问题。此后,针对AUC正则化方法产生的高训练复杂性,提出了两种解决方案进行加速。3.提出了一种流形学习算法的设计框架:mXXX≈ISOMAP+XXX(XXX可为任一基于欧氏距离的学习算法),其仅需将原空间的测地距离近似为ISOMAP降维空间上的欧氏距离,而无需显式ISOMAP降维,即在隐含ISOMAP降维后空间上执行原XXX算法而实现流形结构信息的嵌入。针对观测数据位于或接近于低维非线性流形时欧氏距离难以真实地刻画其几何结构的不足,采用上述框架以SVDD为例设计了流形嵌入的SVDD (mSVDD),算法优点如下:(1)通过对ISOMAP降维空间中欧氏距离的近似计算,解决了前述基于测地距离的SVDD无法直接优化的问题;(2)无需真正执行ISOMAP的MDS (Multidimensional Scaling)和嵌入流形维数的选择(;3)不同于原空间(基于欧氏距离的)SVDD,mSVDD基于测地距离并隐含执行了ISOMAP,故能实现流形嵌入。4.揭示了基于支撑域的异常检测器和密度估计的关系。在综述目前的异常检测方法基础上,重点就两种基于支撑域的单分类器:单类支持向量机(One-class SVM,One-class Support VectorMachine)和支持向量数据描述SVDD,揭示了高斯核核化后它们与密度估计之间的本质性关系:首先,将基于支撑域的单分类器统一到密度估计的框架下;其次,还证明了基于支撑域的单分类器诱导的密度估计和真实密度一致,优化这些单分类器的同时也能减小积分平方误差。

【Abstract】 Outlier detection is to detect and discover those abnormal data patterns not conforming to normal(expected) behavior in observed data. These abnormal patterns are noted as outlier, inconsistent point,novelty or stain for different applications. Recent years, outlier detection is widely applied in faultdiagnosis, disease detection, intrusion detection, credit card (or insurance) fraud detection and personidenfication. In these areas, the abnormal pattern often implies significant (usually greatly harmedeven deadly) behavior. For instance, the abnormal traffic (behavior) in Internet may imply the leakageof sensitive information in attacked host, and credict card fraud behavior would lead to greateconomic loss. For the great pratical meaning and value, outlier detection is now becoming a veryactive and hot research area. As a result, many researchers pay close attention to the research in thearea.Different from other learning task, outlier detection task is with only data patterns conforming toexpected behavior (target class), and rare (even no) data patterns not conforming to expected behavior(outlier class). So there exists extreme imbalance (outlier samples are much less than target samples)leading to great difficulty in outlier detection. Therefore, recent research maily focused inunsupervised learning framework and supervised learning method with a very few labeled outliersamples. Based on the deep research on the principles of various outlier detection methods, robustnessto outliers and the embedding of prior knowledge, the contributions of this paper are as followed:1. First, One-cluster Clustering based Data Description (OCCDD) is proposed which employsthe PCM (Possibilisitic C-Mean) algorithm with one cluster, that is, P1M(PCM,C=1) to compute theweights, and hereafter, obtains an enclosing ball with weight averaging. As a result, OCCDD advoidsthe sensitivity to outliers and high training complexity in Support Vector Data Description (SVDD)due to minimax optimization. Second, global optimal charactistic of P1M which original PCM (C>1)has no is proved in theory. In the end, a multiview OCCDD is proposd to adapt the instinctivemultiview property in text classification. Different from general classifers learn in single view,multiview OCCDD simultaneously learns from all views, and increases the performance owing toeach view boosting mutally.2. A SVDD regularized with Area under the ROC curve (AUC) is proposed towards the situationthat outliers lie around the target samples. The regularized SVDD incorporates AUC measure into theoptimizing object of SVDD, and simultaneously optimizes the volume of minimum enclosing ball andAUC performance so as to deal with the extreme balance in class distribution. Then, two speed tricksare proposed to solve the high training complexity after AUC regularization. 3. A designing framework for manifold-based classifier: mXXX≈ISOMAP+XXX (here, XXXdenotes an existed learning algorithm based on Euclid Distance) is proposed, which replaces theEuclid distance in the feature space after ISOMAP dimension reduction by the Geodesic Distance ininput space, and implicitly conducts a ISOMAP without the truly ISOMAP process. When underlyingmanifold of the observed data existed, SVDD performance degrades since Euclid Distance cannotdepict the true geometrical structure, so we extend this method to SVDD and derivate a SVDD withManifold Embedding (mSVDD). After manifold embedding, mSVDD has advantages as follows:(1)With the approximation of Euclid Distances in the feature space induced by ISOMAP process, itsolves the problem that Geodesic Distance based SVDD cannot be directly optimized;(2)It avoidstruly Multidimensional Scaling (MDS) process in ISOMAP and selection of the dimension of theEuclid space after ISOMAP;(3) Different from formal Euclid Distance based SVDD, mSVDD isbased on Geodesic Distance, and implicitly executes a ISOMAP process, thus it can find a manifoldembedding.4. The relationship beween density estimation and domain-based outlier dectectors is revealed,especially, the essential relation between kernel density estimation and two domain-based outlierdetectors (One-Class Support Vector Machine (OCSVM) and SVDD) induced by Gaussian kernel.That is, domain-based outlier detectors are falling into the framework of density estimation. Moreover,the density estimator induced by OCSVM and SVDD is consistent to the true density; meanwhile,optimizing OCSVM and SVDD can also reduce the Integrated Squared Error (ISE).

节点文献中: 

本文链接的文献网络图示:

本文的引文网络