节点文献

支持向量数据描述的若干问题及应用研究

Research on Some Problems and Applications in Support Vector Data Description

【作者】 谷方明

【导师】 刘大有;

【作者基本信息】 吉林大学 , 计算机应用技术, 2010, 博士

【摘要】 统计学习理论的目的是研究有限样本下机器学习的特征,为有限样本学习问题提供了完备统一的理论框架。支持向量机(Support Vector Machine, SVM)就是在此基础上发展起来的新的学习方法,它是基于结构风险最小化原则以及与多种机器学习方法相融合的标准技术,在使用过程中,己展现出许多优于其它方法的性能。支持向量数据描述(Support Vector Data Description, SVDD)是一种源于统计学习理论和SVM的全新的数据描述方法,与SVM寻求最优超平面不同,SVDD的出发点是寻求一个包容所有目标样本数据的最小超球体。这种有监督的单类分类器,被广泛应用于故障检测、工业及医疗检测、网络安全、目标类识别、入侵检测以及人脸识别等领域,近年来其研究在机器学习领域中非常活跃。然而,由于支持向量数据描述是机器学习领域中比较新的理论,因此在很多方面尚不成熟,亟需做进一步研究。其中,关于SVDD的学习方法的研究是该理论的重点和难点之一。通常被使用的支持向量数据描述方法是有监督的机器学习方法。本文以提高学习能力为目标,分别从无监督和半监督两方面,围绕新型学习算法探索、提高学习准确率、数据预处理及扩展应用等方面对支持向量数据描述的若干问题进行了研究,主要工作结果如下:(1)针对无监督模式下传统的SVDD方法无法准确描绘目标数据的分布问题,提出了基于人工免疫核聚类的支持向量数据描述方法AIKCSVDD(Artifical Immune Kernel Cluster-based SVDD)。AIKCSVDD将人工免疫核聚类产生的记忆抗体作为目标数据点,使用SVDD方法进行学习。在AIKCSVDD中,一方面实现了核聚类方法解决各类数据边界不清晰的长处、免疫网络聚类方法全局收敛以及不需要先验知识等优点的有机结合;另一方面,由于用记忆抗体代替原始数据进行学习,从而在不事先指定分类个数的情况下仍能更好展现原始数据的全局分布特征。(2)针对半监督模式下传统的SVDD方法无法准确描绘目标数据的分布问题,提出了基于半监督学习的加权支持向量域数据描述方法。在现实生活中,大量的具有已知分类信息的数据通常很难得到,为了能在较少已知信息的情况下准确描述未知数据集,考虑将标记繁殖及加权思想应用到SVDD方法中。为此,本文首先利用半监督的标记繁殖算法,根据已知信息有效学习大量未标记数据中的隐含信息,然后再通过加权的SVDD方法学习数据集的潜在分类情况。实验结果表明,该方法在较少已知信息的情况下明显优于传统的SVDD方法。(3)在前述半监督工作的基础上,对半监督学习方法展开深入研究,从经典的的kNN(k-Nearest Neighbor)分类方法入手,给出了一种基于半监督加权距离度量学习的kNN分类方法。为了从有限的已知标签数据中找到一种合适的距离度量,考虑使用相关成分分析(Relevant Component Analysis, RCA)方法来学习一个马氏距离度量。然而传统的RCA方法在度量学习过程中对类别信息标记的数量具有很强的依赖性,且在标记信息数量很少或有错误的情况下可能会引起相应的度量偏差,进而考虑使用半监督的学习方法来克服传统RCA方法的局限性。该方法可从极少量已知标记信息中通过标记繁殖和加权算法学习到一个马氏距离;进而将其应用于kNN分类方法。实验结果表明,在标记信息极少的情况下该方法的分类效果优于采用欧式距离的kNN分类方法。(4)针对在故障诊断等应用领域数据维度较高、数据分布不均匀等特点,本文研究并给出了基于核距离度量LLE的支持向量数据描述方法。为了能够挖掘出隐藏在高维观测数据中有意义的低维结果,更好地提取易于识别的特征,该方法考虑在应用数据的预处理过程中使用LLE方法对数据降维。但由于LLE算法需要稠密采样,在高维稀疏空间中采用欧式距离往往导致效果不尽人意的状况,因此使用核空间距离代替原LLE算法的欧式距离度量,然后利用改进的LLE方法对数据集进行降维,从而使新得到的数据在较小的数据维度中更好地保持了原有的数据流形。最后应用SVDD方法处理新得到的数据。基于SVDD的故障检测实验表明,该方法特别适合于维度较高、分布不均匀的应用数据集。综上所述,本文对支持向量数据描述的若干问题及应用开展了研究,文中提出的一些新方法对于提高SVDD的学习能力很有理论意义和应用价值。在后继工作中,将进一步完善、深入现有的研究结果,同时将研究成果融入到工程应用实践当中。

【Abstract】 Statistical Learning Theory aims to investigate characteristics of learning problems with finite samples and provides complete and consistent theoretical framework. Built on Statistical Learning Theory,Support Vector Machine (SVM) is a classical learning method which uses Structural Risk Minimization principle, is capable of combining with lots of other machine learning technology and shows many better performances. Support Vector Data Description (SVDD) is a completely new method based on Statistical Learning Theory and SVM. Different from SVM’s looking for hyperplane, it pursues to find a hyperspere enclosing target data. SVDD is a classical one-class classifier or data description method and has widespread application in the field of fault detect, industrial and medical diagnosis, network security, target class identification, intrusion detect, face recognition and so on. SVDD becomes hot spot of machine learning in the recent years.However, SVDD is still immature in many aspects and needs further research for it is a quite new theory in machine learning. Among these researches, SVDD’s learning algorithm is a key point and difficult part. In this thesis, we aim to improve SVDD’s learning capability under the setting of unsupervised learning and semi-supervised learning. We explores SVDD’s some problems circling the aspects on improving learning accuracy, studying new learing algorithm, data preprocessing, application extension and so on. The following is the detail:(1) To solve inaccurate classification problem of conventional SVDD in unsupervised settings, AIKCSVDD, a support vector data description method based on artificial immune kernel clustering is proposed. It uses memory antibodies generated by artificial immune kernel clustering algorithm as target data, and then uses SVDD to execute multi-class classification. On one hand, immune kernel clustering method does not need prior knowledge and can recognize data of no clear boundaries better; on the other hand, using memory antibodies as target data can reflect original data’s global distribution better and need not know previously cluster number.(2) To enhance classification precision of traditional Support Vector Data Description with less classification information, the method of Semi-Supervised Weighted Support Vector Data Description for data classification is proposed, which uses a graph-based semi-supervised learning technology to learn the potential classification information of large number of unlabeled data with small amount of labeled data, then adopts a method of weighted Support Vector Data Description to learn a classifier for the whole data. Experiments on UCI datasets show that our method is efficient in the context of tiny known classification information.(3) K-Nearest Neighbor (kNN) classification is one of the most popular machine learning techniques, but it often fails to work well due to less known information or inappropriate choice of distance metric or the presence of a lot of unrelated features. To handle those issues, we introduce a semi-supervised distance metric learning method for kNN classification. This method uses a semi-supervised Label Propagation algorithm to gain more label information with tiny initial classification information, then resorts to an improved weighted RCA to learn a Mahalanobis distance function, and finally uses learned Mahalanobis distance metric to replace the original Euclidean distance of kNN classifier. Experiments on UCI datasets show the effectiveness of our method.(4) In real application, such as fault diagnosis, data often has very high dimension and non-uniform distributions. We propose a new method combining kernel distance metric LLE and SVDD solving these problems. In order to mine low-dimension meaningful information hiding in high-dimension data and extract better classification features, we uses LLE to dimensionality reduction in data preprocessing. For LLE needs dense sampling and has unsatisfactory results with Eucilian distance in high-dimension sparse space, we use distance metric in kernel space to improve LLE and get better original data manifold in low-dimension. Then we utilize SVDD method to process the new dataset. Experiments results show the proposed method has better performance for data of high-dimension and non-uniform distributions.On the whole, this thesis does researches on some problems and applications in Support Vector Data Description method. These researches have certain theoretical and practical significance in improving SVDD’s learning capability. In future works, in addition to improve our current woks, we hope to make deeper research on SVDD and apply them to real applications.

  • 【网络出版投稿人】 吉林大学
  • 【网络出版年期】2011年 05期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络