节点文献

监督信息在图学习中的有效利用

The Utilization of the Supervised Information in Graph-based Learning

【作者】 何萍

【导师】 陈崚;

【作者基本信息】 南京航空航天大学 , 计算机应用技术, 2012, 博士

【摘要】 基于图的学习方法近年来受到越来越多研究者的关注,它不仅有着深厚的图论理论作基础,而且还指明了数据间的联系与数据本身同样重要这一提高现有学习算法性能的关键。监督信息对于机器学习中的有监督学习和半监督学习,既是重要的信息来源,又是最终的学习目的。本文将对应到图上的监督信息划分为两类——点约束和边约束,并把机器学习中的有监督分类和半监督聚类分别视为具有点约束的学习问题和具有边约束的学习问题,重点从图的角度讨论这两类学习算法对监督信息的利用方式。传统的有监督分类算法用基于属性的方法估计从数据到类别的映射函数,只利用了顶点上的信息。虽然核方法引入了数据间的两两关系,但也只利用了数据的条件属性,没有把顶点上的监督信息扩展到边上加以利用。类似地,大多半监督聚类算法用点对约束限制可行解的搜索范围或者学习一个合适的度规,从而使聚类结果尽可能满足事先给定的约束条件,但却鲜少考虑把边上的监督信息的扩展到顶点上加以利用。本文提出将已知的监督信息在图的点边结构上进行传递,从而提高有监督分类和半监督聚类算法对监督信息的有效利用。首先,我们提出了一种统一的非线性分类框架,称为流形映射机,它由监督式流形映射,分类器构建和测试数据扩展三部分构成。该分类框架将顶点间的类别关系融入到边的权重上(点—边),然后有监督地将不同类别的数据在新的低维特征空间中分离开来(边—点),有利于后续分类器的构建,是一种“点—边—点”的途径。为了使测试数据映射到目标空间后能达到类似的效果,我们在数据的原始流形和目标流形之间搭建了一个“桥”,通过最小化测试数据从原始流形和目标流形映射到该中间“桥”上的差异,确定测试数据在目标流形上的最佳映射。此外,我们还讨论了流形映射机与几种著名流形学习算法之间的联系,证明了该框架可行性和广泛性。其次,在流形映射机框架的前提下,我们提出了一种监督式谱空间分类器。该分类器用线性融入监督信息的方式,将输入数据映射到低维的监督式谱空间中。然后,S3C分别采用了三种不同的分类算法用于分类器构建。在测试数据扩展阶段,我们证明了S3C通过构建流形桥所推得的测试数据最佳映射与Nystr m方法具有相同的形式。大量基于人工数据集和真实数据集的实验结果显示,S3C的分类性能显著优于其它多种经典的分类算法。最后,我们提出了一种局部约束传播的半监督聚类算法,可用于处理既有必连约束又有不连约束的多类别半监督聚类问题。该算法先确定每个约束顶点的影响范围(边—点),然后根据每个无约束边所连接的顶点与有约束顶点之间的相似度,将约束边的影响成比例地传播开去(点—边),因而是一种“边—点—边”的途径。我们将每个顶点的传播范围及其影响程度定义为介于细粒度顶点和粗粒度簇之间的一种中间结构,称之为“组件”。通过评估各个组件传播范围的准确程度,算法还可以自适应地调节各个点对约束在不同组件上的传播强度,使得置信度高的组件受到的约束影响较大,而置信度低的组件则受到的约束影响较小。大量基于UCI数据库,文本文档,手写数字,英文字符,人脸识别和图像分割数据集上的实验证明,局部约束传播半监督聚类算法比其它经典的半监督聚类算法更准确,也更高效。

【Abstract】 Graph-based learning has attracted more and more interest from the researcher in recent years. Itnot only has mathematical support from the graph theory, but also indicates the key to improve theperformance of the existing learning algorithms that the connection among data points is as importantas the data points. The supervised information is critical information resource as well as ultimatelearning target for supervised learning and semi-supervised learning in machine learning. In this paper,the supervised information in graph-based learning is partitioned into two categories, including vertexconstraints and edge constraints. The supervised classification and the semi-supervised clustering areviewed as vertex-constrained and edge-constrained learning problems respectively. We will putemphasis on the discussion of the utilization of the supervised information by these two groups oflearning algorithms.Traditional supervised classification algorithms prefer the attribute-based approaches in estimatingthe underlying functions that map the attributes to the class labels, which only extracts the informationabout vertices. Although the kernel method later introduces the pairwise relationship of the data points,it only computes with the conditional attributes, but has not extended the supervised information fromthe vertices to the edges. Similarly, most of the existing semi-supervised clustering algorithms bias thesolution space or learn an appropriate metric to make the clustering result as consistent with theprovided constraints as possible. However, they also fail to extend the superved information from theedges to the vertices for full utilization. This paper proposes to transfer the supervised informationthrough vertices and edges on graph, so that the utilization of the supervised information in these twogroups of learning problems can be improved.First, we propose a unified framework for nonlinear classifier design called Manifold MappingMachine (M3). It is composed of three stages, supervised manifold mapping, classifier constructionand out-of-sample extension. As a ‘vertex–edge-vertex’ approach, S3C integrates the classrelationship of the vertices into the weight of their edges (vertex-edge), and then transforms thedifferent classes of data into the low-dimensional target manifold separately (edge-vertex), whichsimplifies the subsequet classifier construction. In order to achieve a similar effect in the spacetransformation of the test data, we construct a ’’bridge’’ between the original manifold and the targetmanifold. By minimizing the difference of the test data mapped from the two manifolds on theintermediate ’’bridge’’, the optimal mapping of the test data can be determined uniquely. To prove the feasibility and the generalization ability of M3, we also discuss its connections with severalwell-known manifold learning algorithms as well.Second, we present a nonlinear classifier under the framework of M3, named Supervised SpectralSpace Classifier (S3C). It integrates the supervised information linearly to map the input data into thelow-dimensional supervised spectral space. Then S3C adopts three different classification algorithmsfor classifier construction. During the out-of-sample extension state, we prove that the optimalmapping of the test data derived by S3C through the manifold “bridge” has the same form as thatderived from the Nystr m Approximation method. Sufficient experimental results show that S3C issignificantly superior to other state-of-the-art nonlinear classifiers on both synthetic and real-worlddata sets.Last but not least, we propose a semi-supervised clustering algorithm named SCRAWL, short forSemi-supervised Clustering via RAndom WaLk, to deal with multi-class semi-supervised clusteringproblems given both must-link and cannot-link constraints. As an ‘edge-vertex-edge’ approach,SCRAWL first determines the propagation range of each constrained vertex (edge-vertex), followedby the expansion of the constraint influence according to the similarities of the vertices connected bythe unconstrained edges to the constrained vertices (vertex-edge). We define the propagation range ofeach constrained vertex and the degrees of its impact as an intermediate structure between thefine-grained vertex and the coarse-grained cluster, called “component”. By estimating propagationaccuracy of each component, SCRAWL can also adjust the strength of the constraint propagation overthe different components, so that the components with higher confidence can receive greater influencefrom the propagated constraints, while the components with lower confidence can only receive lessinfluence by contrast. The experiments on UCI database, text documents, handwritten digits,alphabetic characters, face recognition and image segmentations demonstrate the effectiveness andefficiency of SCRAWL.

节点文献中: