节点文献

自然最近邻居在高维数据结构学习中的应用

Learning Structure Features in High Dimensional Data Based on Natural Nearest Neighbor

【作者】 邹咸林

【导师】 朱庆生;

【作者基本信息】 重庆大学 , 计算机应用技术, 2011, 博士

【摘要】 对大规模高维数据结构的分析和研究一直是数据挖掘、机器学习和模式识别等领域中十分重要的研究课题之一,同时也是现代科学技术所必须面对的困难之一。高维数据的结构特征主要包括聚类结构和流形结构,所用到的研究方法涉及到了多个数学分支,如多元分析、非线性泛函分析及统计等,寻找简单而有效的方法是人们一直追求的目标。因此对高维数据结构的学习具有十分重要的价值。本文提出了一种新的最近邻居概念,自然最近邻居(Natural Nearest Neighbor : 3N),它是一种无尺度(scale-free)的最近邻居,其核心的思想是最离群的数据对象只有一个最近邻居或者说具有最低的能量,而稠密的对象有较多的邻居或较高的能量,而且,自然最近邻居的计算过程可以自动地实现,这是本文的主要贡献和创新。由自然最近邻居可以自动地形成一种自适应的数据最近邻域图或数据网络,借助复杂网络的概念,这也是一种无标度(scale-free)的网络,这种邻域图或网络可以很好地表示数据对象之间的关联关系,能够明显地给出各个数据对象的密度信息。自然最近邻居的形成机制,能够有效地降低跨边界寻找最近邻居的风险,因而可保持具有聚类结构的数据的凝聚状态,自然地展示出聚类数据的基础结构(infrastructure),为后续的聚类和流形结构分析提供一个稀疏和无参数的基础图模型。将这种自适应邻域图用于流形学习算法,如Isomap、LLE、HLLE、LTSA等,可得到相应的无自由参数的自适应流形学习算法,避免了传统流形学习算法中关于邻域参数的选择问题,原因在于由自然最近邻居所形成的邻域图能很好地逼近低维嵌入的数据流形。这为机器学习领域中的两大热点问题之一的维数约简(Dimensionality Reduction)方法提供了一个新的视角。本文还研究了离群数据对象的特征子空间结构与聚类特征子空间结构之间的关系,还提供了一种从邻居的角度来观察离群对象的行为。本文的主要创新和贡献包括如下几个方面。(1)提出了自然最近邻居(3N)的概念,并提供了一个十分简单的计算方法。在标准分布(均匀分布、正态分布)及规则数据集上验证了这种邻居概念的合理性。与k-最近邻居和ε-最近邻居相比含有更丰富的信息:如密度、离群信息以及结构逼近等。从自然最近邻居数目的分布(直方图)可以观察数据集的分布状态,这种分布与数据集的高维特性无关。(2)将自适应邻域图用于代表性的流形学习算法:如全局结构流形学习算法Isomap和局部结构流形学习算法LLE、HLLE及LTSA等算法,提出了无自由参数的自适应流形学习算法3N-Isomap、3N-LLE及3N-LTSA等,同时解决了近十年来流形学习算法中关于如何选择自由参数的问题。使任何人都可使用这批算法来观察自已领域范围内的数据,而不受困扰。在三个实际问题中应用了3N-Isomap算法,并提出了自动多流形学习算法、大规模流形学习算法(由一个通用的简单随机采样算法实现)及同质数据复杂性分析方法。(3)将自适应邻域图用于谱图聚类算法,如MNcut算法,提出了一个改进的算法3N-MNcut,其性能优于原聚类算法。(4)提出了离群点的特征子空间结构与聚类的特征子空间结构具有相同的特征依赖,即都依赖于最大特征值及相应的特征子空间。提出了一个新的离群指数,可以观察离群点的动态行为。

【Abstract】 Analyzing and investigating the structure of data set with large-scale and high-dimensional has became a very important task in the fields of data mining, machine learning and pattern recognition, and faced us with a puzzled problem in modern science and technology conditions. The features of structure contained in high-dimensional data are usually displayed in clustering and manifold styles, the corresponding methods used to reveal the features involve many mathematics branches, such as multivariate analysis, functional analysis and statistics etc., but the simple and efficiency approaches have been received special attentions to researchers.In this paper, a novel concept in terms of nearest neighbor is proposed and named natural nearest neighbor (3N neighbor), in contrast to k-nn andε-nn, it is a scale-free nearest neighbor. The key idea is that only one neighbor should be taken as a nearest neighbor for some farthest outliers within data set, however the dense points should have more nearest neighbors, also the calculation of this type of neighbors for all points can be implemented in an automatic way, it is the highlight point included in the paper. Of course, an adaptive nearest neighborhood graph (ANNG) or data network induced by connecting each data point to its corresponding 3N neighbors can also be constructed automatically with respect to any data sets, by using the similar concept within complex network, it indicates a scale-free network closely associated to the whole data and presents a natural relationships among data points, where the density information is clearly displayed and can be obtained easily. By using of the mechanism of forming 3N neighbors, the risk of crossing the boundaries between classes to find the neighbors should be reduced significantly, the agglomerate status existed in original data set will be preserved within the corresponding ANNG or network where an infrastructure with primary form of clustering is maintained and provided a sparse and more efficient graph model for the subsequent steps of learning the structure of clustering or manifold. On the other hand, applying the constructed ANNG to the classic manifold learning algorithms, such as Isomap, LLE, HLLE and LTSA etc., the extended and adaptive versions corresponding to these algorithms can be easily formulated, and more importantly, the long standing problem in manifold learning about the optimal size of neighborhood selection would to be avoid in practice. It provides a new point of view for the problem of dimensionality reduction which is one of the two hotspot issues in the field of machine learning.Additionally, two subspace structures, i.e. outlying features subspace and clustering features subspace, are discussed in this paper from a uniform point of clusters view, and a new outlier index is defined in order to seek the outlier’s behaviors. Main innovations and contributions are listed as following.(1) Natural Nearest Neighbor (3N), a novel concept is proposed and implemented in a very simple way. The rationality in terms of 3N neighbor is validated on the data sets with standard distributions (uniform and normal distribution) and on the regular data derived from special mechanism such as grid data. The status of data distribution can be observed from the histogram of all numbers of 3N neighbors which is independent of the dimension. 3N neighbor contains more meaningful information than the tranditional k-nn or∈-nn.(2)Applying the concept 3N to the global manifold learning algorithm ISOMAP and the local manifold learning algorithms LLE, HLLE and LTSA, the corresponding extended algorithms 3N-ISOMAP, 3N-LLE and 3N-LTSA can be easily obtained but without in use of the free parameter. It means that these algorithms may be used in anywhere and anytime without the consideration of what the value of free parameter should be taken as. Such as analysing the complexity of some homogeneous high-dimensional data subsets, automatically learning the multi-fold or clusterings, and the large scale manifold learning etc..(3) An improved spectral graph clustering algorithm 3N-MNcut is obtained by applying the ANNG graph to the original algorithm MNcut, due to the constructed ANNG graph induced a well-formed similarity matrix with sparseness and preserving the local manifold structure.(4)The features of structure subspace in terms of outliers and clusters are closely interralated and correspond to the largest eigen-values. A new outlier index is proposed for observing the behaviors of outliers from the neighbors with different distances.

  • 【网络出版投稿人】 重庆大学
  • 【网络出版年期】2012年 07期
  • 【分类号】TP311.12-4
  • 【被引频次】4
  • 【下载频次】278
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络