节点文献
流形学习算法及其应用研究
The Study of Manifold Learning Algorithms and Their Applications
【作者】 雷迎科;
【导师】 黄德双;
【作者基本信息】 中国科学技术大学 , 模式识别与智能系统, 2011, 博士
【摘要】 流形学习方法作为一类新兴的非线性维数约简方法,主要目标是获取高维观测数据的低维紧致表示,探索事物的内在规律和本征结构,已经成为数据挖掘、模式识别和机器学习等领域的研究热点。流形学习方法的非线性本质、几何直观性和计算可行性,使得它在许多标准的toy数据集和实际数据集上都取得了令人满意的结果,然而它们本身还存在着一些普遍性的问题,比如泛化学习问题、监督学习问题和大规模流形学习问题等。因此,本文从流形学习方法存在的问题出发,在算法设计和应用(图像数据与蛋白质相互作用数据)等方面展开了一系列研究工作。首先对流形学习的典型方法做了详细对比分析,然后针对流形的泛化学习和监督学习、表征流形的局部几何结构、构造全局的正则化线性回归模型、大规模数据的流形学习等几个方面进行了重点研究,提出了三种有效的流形学习算法,并和相关研究成果进行了理论与实验上的比较,从而验证了我们所提算法的有效性。全文的主要工作概括如下:(1)在深入研究局部样条嵌入算法(LSE)的基础上,引入明确的线性映射关系,构建平移缩放模型和正交化特征子空间,提出了一种正交局部样条判别投影算法(O-LSDP)。有效解决了原始LSE算法存在的两个主要问题:样本外点学习问题和无监督模式学习问题,从而使该算法能够应用于模式分类问题并显著改善了算法的分类识别能力。在标准人脸数据库上进行实验比较分析,验证了该算法的有效性与可行性。(2)在兼容映射的概念框架下,提出了一种局部多尺度回归嵌入算法(LMDSRE)。LMDSRE算法首先利用局部多维尺度分析(LMDS)构建每个样本点邻域的局部坐标来表示低维流形的局部几何结构,然后拟合正则化的线性回归模型并排列所有的局部等距坐标,从而构建全局唯一的低维坐标。该算法作为一种新的流形学习方法具有局部等距的特点,能够应用于非线性维数约简和数据可视化分析,在六个标准人工数据集和三个实际数据集上的实验结果验证了该方法的有效性。(3)针对ISOMAP算法计算复杂度高的问题,提出了一种快速等距特征映射算法(Fast-ISOMAP)。Fast-ISOMAP算法首先利用最小子集覆盖策略(MSC)从数据集中选择p个Landmark点( p n),从而在构造最短路径距离矩阵时,用p×n距离矩阵D p×n代替了原始的n×n距离矩阵Dn×n,然后运用Landmark MDS算法将所有样本嵌入到低维特征空间。与原始的ISOMAP算法相比,Fast-ISOMAP算法在不显著改变原始ISOMAP算法嵌入性能的条件下,大大提高了算法的计算效率,该算法适合应用于大规模流形学习问题。在标准数据集上的实验结果验证了该算法的有效性。(4)提出了一种鲁棒的基于快速流形嵌入的蛋白质相互作用数据可信度评估与预测新方法。首先通过对蛋白质相互作用数据进行低维流形建模,然后采用快速等距特征映射流形学习方法将蛋白质相互作用数据映射到低维度量空间,从而把蛋白质相互作用数据可信度评估与预测的生物问题转化为低维嵌入空间中数据点之间相似性度量的数学问题,最后根据蛋白质对在低维嵌入空间的相似性度量来构造加权CD-Dist可靠性指数用于评估与预测可信度。在三个由不同高通量实验技术产生的不同规模的酵母蛋白质相互作用数据集上的实验结果表明,基于快速流形嵌入的方法所获得的高可靠性相互作用数据具有更高的功能一致性与细胞组分一致性。据我们所知,本章所提出的方法首次利用了流形学习理论来解决蛋白质相互作用数据可信度的评估与预测问题。该方法有效克服了现有方法需要额外先验信息和对蛋白质相互作用网络稀疏程度敏感的问题,为检测蛋白质相互作用网络中的假阳性与假阴性“噪声”问题提供了一条新的解决途径。
【Abstract】 Manifold learning is a new kind of nonlinear dimensionality reduction method for finding low-dimensional compact representations of high-dimensional observation data and exploring the inherent law and intrinsic structue of data. At present, manifold learning has become a hot issue in the fields of data mining, pattern recognition, machine learning, and other related research topics. These manifold learning methods do yield impressive results on some artificial and real world benchmark data sets due to their nonlinear nature, geometric intuition, and computational feasibility. However, the original manifold learning methods still show some common problems, such as out-of-sample, supervised learning, large-scale manifold learning, and so on. In order to overcome these problems, this dissertation carries out a series of researches on algorithm design and their applications in image and protein-protein interactions (PPI) data. Firstly, classical manifold learning methods were analyzed and compared in detail. Secondly, five problems about manifold learning were mainly investigated, which include out-of-sample, supervised learning, characterization of local manifold geometry, construction of globally regularized linear regression model, and large-scale manifold learning. Finally, in this dissertation we proposed three manifold learning algorithms. Our proposed algorithms were compared with the related researches in theories and experiments. And the results demonstrate the effectiveness of our proposed algorithms.The main work for this dissertation can be summarized as follows:(1)Based on the analyses of local spline embedding (LSE) method, we proposed an efficient feature extraction algorithm called orthogonal local spline discriminant projection (O-LSDP). By introducing an explicit linear mapping, constructing different translation and rescaling models for different classes as well as orthogonalizing feature subspace, O-LSDP can effectively circumvent the two major shortcomings of the original LSE algorithm, i.e., out-of-sample and unsupervised learning. O-LSDP not only inherits the advantages of LSE which uses local tangent space as a representation of the local geometry so as to preserve the local structure, but also makes full use of class information and orthogonal subspace to significantly improve discriminant power. Extensive experiments on standard face databases verify the feasibility and effectiveness of the proposed algorithm. (2)A new manifold learning algorithm called local multidimensional scaling regression embedding (LMDSRE) was developed under the conceptual framework of compatible mapping. LMDSRE is to use local multidimensional scaling (LMDS) to construct the local coordinates of each data point and its neighbors for representing local geometry structure of low-dimensional manifold. Regularized linear regression models are then fitted to map each of the local coordinates to its own single low-dimensional global coordinate. LMDSRE as a new manifold learning method has the characteristic of local isometry and can be used for nonlinear dimensionality reduction and data visualization analysis. The experiments on six toy datasets and three real-world datasets illustrate the validity of our method.(3)For the high complexity problem of the ISOMAP algorithm, we designed a new fast isometric feature mapping (Fast-ISOMAP) method. Fast-ISOMAP is to utilize minimum set cover strategy to designate p ones among all data points to be landmark points. Instead of computing Dn×n, we only computed the p×n matrix D p×n of distances from each data point to the landmark points in constructing the shortest path distance matrix. Landmark MDS is then applied to embed all data points to low-dimensional feature subspace. It was found in experiments that Fast-ISOMAP can greatly improve the computational efficiency of the original ISOMAP and be used in large-scale manifold learning problems under the condition that it does not significantly change the performance of ISOMAP. Experimental results on many artificial benchmark datasets show the effectiveness of our proposed algorithm.(4)We developed a robust computational technique for assessing the reliability of protein interactions and predicting new protein interactions by fast manifold embedding algorithm. Firstly, we adopted low-dimensional manifold modeling to fit a PPI network and utilized our proposed fast isometric feature mapping (Fast-ISOMAP) to transform a PPI network into a low dimensional metric space, which recasts the problem of assessing and predicting protein interactions into the form of measuring similarity between the points located in its metric space. Then a reliability index (RI), a likelihood indicating the interaction of two proteins, is assigned to each protein pair in the PPI networks based on the similarity between the points in the embedding space. The performance of the proposed approach is evaluated by using functional homogeneity and localization coherence of protein interactions from three PPI datasets that are derived from various scales and high-throughput techniques, i.e., yeast-two-hybrid (Y2H), tandem affinity purification (TAP), and mass spectrometry (MS). Experimental results demonstrate that the interactions ranked top by our method have high functional homogeneity and localization coherence. Our proposed method can effectively and efficiently overcome the disadvantages that most existing methods require additional prior information and are sensitive to the sparseness of PPI network. Moreover, to our knowledge, our proposed method is the first research work aiming at utilizing manifold learning theory to assess and predict protein interactions. Therefore, the proposed algorithm is a much more promising method to detect both false positive and false negative interactions in PPI networks.
【Key words】 Manifold learning; Dimensionality reduction; Orthogonal local spline discriminant projection; Local multidimensional scaling regression embedding; Fast isometric feature mapping; Face recognition; Data visualization; Protein-protein interactions; False positives; False negatives;