节点文献

基于图和熵正则化的半监督分类算法

Semi-supervised Classification Algorithm Based on Graph and Entropy Regularization

【作者】 刘小兰

【导师】 郝志峰;

【作者基本信息】 华南理工大学 , 计算机应用技术, 2011, 博士

【摘要】 半监督学习(Semi-supervised Leaning,SSL)试图利用大量的无标记样本学习数据的内在几何结构,在此基础上利用少量的有标记样本完成降维、分类和回归等任务。由于SSL在减少人工标注代价、提高机器学习性能方面的突出优势,以及在网页检索、文本分类、基于生物特征的身份识别和医疗诊断等领域应用的广泛性,从上世纪90年代开始,它就在机器学习界引起了关注。目前,SSL已成为机器学习研究中最受关注的问题之一。本文在分析了SSL的发展现状和目前仍存在的问题的基础上,对基于图和熵正则化的半监督分类学习中的若干重要问题进行了研究,具体研究内容和成果如下:1、数据图的构造。数据图的构造是设计基于图的SSL算法的第一步。大多数传统数据图构造方法是参数依赖的,且对参数较敏感;另一方面,最近提出的基于稀疏表达的最小化L1模构造模型不能保证非负解,因此不能直接用作图上边的权重。针对这些不足,提出了两个基于非负稀疏表达的最小化L1模构造模型:L1IMP和L1IMPv。两个新模型在现有最小化L1模构造模型的基础上增加了非负约束,从而使得模型的稀疏解不仅可以反映成对样本间的紧密程度,而且可以直接用作图上边的权重。此外,新的图构造方法可以在确定图的邻接结构的同时完成边的权重计算。结合标记传播算法,在UCI和人脸数据集上的实验结果表明,L1IMP和L1IMPv在大多数情况下的分类效果优于传统方法。2、基于不相似性的图SSL算法。负相似性在协同过滤等问题中经常出现。针对目前提出的大部分图SSL算法都不能处理不相似性或负相似性的不足,提出了一个基于负相似性的图SSL模型SMLP。SMLP的优化目标是如下两个量的比值:类标记和正相似性的不一致性以及类标记和负相似性的一致性;同时,SMLP允许有标记样本的标记予以重新标记,运用一种全局优化方法求解SMLP,可以在O ( n3 logε-1 )时间内获得一个ε-最优解。在UCI数据集和协同过滤问题上验证了SMLP算法的有效性。3、适于处理标记有噪声数据的图SSL算法。算法的基本思路是运用软标记方法来处理标记有噪声数据。首先,利用各种标记软化方法将样本的类标记转化为软标记,相比硬标记,软标记可以更好地容纳监督者对模式类别的不确定性。在此基础上,嵌入现有的基于图的SSL算法LGC,以达到预期目的。在有类重叠的UCI和物体识别数据集上的实验表明,与基于硬标记的LGC算法相比,基于软标记的LGC算法可以更好地用于标记有噪声数据的半监督分类学习。4、基于熵正则化的SSL算法。提出了一个基于条件Havrda-Charvat’s Structuralα-熵正则化的直推式半监督分类模型MinEnt。MinEnt的基本思想是:一个好的聚类标准是对无标记样本的一个好的刻画。在MinEnt模型中,用条件Havrda-Charvat’s Structuralα-熵聚类标准刻画无标记样本及其所属类别之间的关系,同时对有标记样本采用其对数似然函数。设计了基于拟牛顿法的求解算法。所提出的算法是判别式的,降低了对模型的依赖程度;同时,它可以预测样本空间中任何一个样本的标记,是一种直推式方法。在UCI数据集上的仿真实验验证了该算法的有效性。

【Abstract】 Semi-supervised learning (SSL) tries to discover the intrinsic structure of the given data by use of lot of unlabeled data, on the basis of which, it finishes the task of dimensionality reduction, classification and regression by making use of few labeled data. Because of its prominent advantage of reducing the cost of labeling manually and improving the performance of machine learning, and its widespread popularity in web page retrieval, text classification, personal identification based on biometrics feature and medical diagnosis, SSL has received the attention of machine learning community since 1990. Now, SSL becomes one of the most active research areas in the machine learning field. Based on analyzing the state of the art and the existing problems of SSL, the thesis mainly investigates some key issues of graph-based and entropy regularization SSL. The contributions are as follows:1. Graph construction. Graph construction is the first step of graph-based SSL algorithm. Most traditional graph construction methods depend on parameters and are sensitive to these parameters. The solutions of the recently proposed L1 norm reconstruction error minimization graph construction models based on sparse representation may be negative, so they can not be used as the graph weights directly. According to these deficiencies, two L1 norm minimization graph construction models based on nonnegative sparse representation named L1_IMP and L1_IMPv which add nonnegative constraints to the existing L1 norm minimization models are proposed. The solutions of the proposed models can not only reflect the close relation between the sample pairs, but also can be used as the graph weights directly. Moreover, L1_IMP and L1_IMPv complete the neighborhood graph construction and graph weights calculation within one step. Experimental results on UCI and face recognition datasets show that the classification accuracy of the label propagation algorithms using L1_IMP and L1_IMPv are better than that of the label propagation algorithms using traditional graph construction methods in most cases.2. Graph-based SSL algorithm by dissimilarity. Dissimilarity, or negative similarity frequently appears in many practical applications such as collaborative filtering problem. Considering that most graph-based SSL algorithms can not deal with negative similarity, a graph-based SSL model based on negative similarity named SMLP is proposed. The optimization objective of SMLP is the ratio between the following inconsistency and consistency: the inconsistency between the class assignment and the positive similarity, and the consistency between the class assignment and the negative similarity. Also SMLP allows the labeled data to be relabeled. A global optimal algorithm is applied for solving SMLP, yielding anε?global optimal solution in a computational effort of O ( n 3 logε?1 ). Experimental results on UCI datasets and collaborative filtering problem verify the effectiveness of SMLP algorithm.3. Graph-based SSL algorithm for misclassified data. We use soft labels to deal with misclassified data in the circumstance of SSL. First hard labels of labeled samples are converted to soft labels by several existing methods which can accommodate the uncertainty of an external teacher about uncertain patterns better than hard labels. Then soft labels are embedded into the existing graph-based SSL algorithm LGC to deal with the misclassified data. Experimental results on UCI and object recognition datasets with some classes overlapping show that LGC with soft labels is more resistant to label errors compared with LGC with hard labels.4. SSL algorithm based on minimum entropy regularization. A discriminative SSL classification model named MinEnt is established based on the conditional Havrda-Charvat’s Structuralα-entropy regularization. The basic idea of MinEnt is that a good clustering criterion is also a good description of the unlabeled data. In MinEnt, conditional Havrda-Charvat’s Structuralα-entropy clustering criterion is used to describe the relation of unlabeled data and theirs labels, log likelihood function is used to describe the labeled data and Quasi-Newton method is applied for solving MinEnt. The proposed algorithm is discriminative which has less dependence on the model selection. Moreover, the proposed algorithm is inductive, so it can predict the labels of the out of the samples easily. Experimental results on several UCI datasets demonstrate the effectiveness of the proposed algorithm.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络