节点文献

组稀疏子空间的大间隔特征选择

Large Margin of Group Sparse Subspace for Feature Selection

【作者】 刘波

【导师】 房斌;

【作者基本信息】 重庆大学 , 计算机应用技术, 2013, 博士

【摘要】 随着信息技术、生物技术的飞速发展,由此产生大量高维数据。特征选择作为高维数据的一种有效分析方法,受到越来越多研究者重视。所谓特征选择是指在给定的特征集中,选择一个子集,使其能很好表示原始数据。对高维数据特征选择的好坏,直接影响到后续学习算法的最终结果。因此,高维数据的特征选择也是机器学习研究的重要内容。本文以高维数据作为研究对象,针对该领域实际存在的问题,研究高维数据的特征选择方法。论文的主要内容和创新点如下:①采用局部线性方法来有效表示非线性高维数据的结构。任意复杂问题可分解成多个局部线性问题。基于此,本文通过样本的同类最近邻和不同类最近邻来局部线性近似高维非线性数据的结构。该方法非常简洁、高效且直观。②建立组稀疏子空间的大间隔特征选择模型(简称GSLM)。将样本与不同类最近邻之间的距离信息和样本与同类最近邻的距离信息投影到子空间,然后将两者相减的差值作为子空间的样本间隔,将这些间隔相加,就可得到所有样本的间隔。最大化此间隔会使投影到子空间的样本与不同类最近邻之间的距离尽量大,而与同类最近邻之间的距离尽小量。因此,最大化间隔会使投影到子空间的最近邻信息被尽量保持,从而选择有效特征。为了更合理解释由子空间所选择的特征,让所建立的模型有较强的抗干扰能力,且不破坏样本已有的概率分布。本文在目标函数中引入L2,1范数作为正则项。针对该模型的目标函数,提出一种高效的求解算法。该算法可以得到局部最优解。实验验证,该模型能选择较好的特征且相应求解算法效率很高。③提出Trace Ratio-组稀疏子空间的大间隔特征选择模型(简称TR-GSLM)。用样本与不同类最近邻的距离除以样本与同类最近邻距离之商作为间隔值。若这种间隔越大,则样本与不同类最近邻的距离应尽量大,而与同类最近邻之间的距离要尽量小。因此,最大化此间隔也可以使投影到子空间的最近邻信息被尽量保持,从而选择有效特征。针对非凸目标函数,提出一种新的迭代算法,它可获得全局最优解。同时,也给出该算法的收敛性证明。实验验证,该模型所选择的特征比GSLM要好,但其求解算法的效率较低。④提出一种增强的Trace Ratio-组稀疏子空间的大间隔特征选择模型(简称ETR-GSLM)。采用替换变量法来提高TR-GSLM求解算法的效率,由此创建目标函数,并提出相应的求解算法。虽然目标函数极其复杂且非凸,但该算法仍能得到全局最优解。由于该算法的求解过程需保持样本间隔矩阵的正定性,本文采用修改的Cholesky算法来保证此矩阵的正定性。最后证明该算法收敛。实验验证,该模型所选择的特征比前两种模型都要好,且该算法效率比TR-GSLM的要高。通过大量的实验验证本文所提出的三种算法在分类精度上比其它相近算法要好。并验证所提出算法对核函数参数、正则参数不敏感。在运行时间上,本文提出的GSLM算法具有优势。

【Abstract】 With the rapid development of information technology, bio-technology andInternet technology, high-dimensional data is becoming increasingly popular.Featureselection is an effective method of high-dimensional data; it attached more and moreresearchers. The feature selection is to find a subset that can be a good representation ofthe original data in a given feature set. The following learning algorithm is directlyaffected by feature selection of high-dimensional data.In this dissertation, we investigate feature selection methods, and apply tohigh-dimemsional data. The main content and contributions include:①We use local linear method to represent the non-linear high-dimensional datastructure. The key idea is to decompose an arbitrarily complex nonlinear problem into aset of locally linear ones. We take the nearest neighbor of the sample with same labeland the nearest neighbor of the sample with different lable to approximate the nonlinearhigh-dimension data structure.The method is very simple, efficient and intuitive.②We establish a large margin model of group sparse subspace for feature selection(GSLM). Local nearest neighbor of the sample is projected into the subsapce and definethe margin in the subspace. The nearest neighbor’s distance of the samples with differentlabel and the nearest neighbor’s distance of the samples with same label are project intothe subspace, and then to subtract the two difference value as the margin of the samplein the subspace, the sum of these margins is termed as the margin of all samples.Maximizing the margin can be made the nearest neighbor with the different label shouldbe as large as possible, and the nearest neighbor with the same label should be as smallas possible, and preserve the information of the nearest neighbor to select usefulfeature.In order to more reasonable interpretation of the selected feature by the subspace,and make model to have the strong robust capability, we incorporate theL2,1norm intothe objective function. We propose an efficient algorithm to solve the objectivefunction.the algorithm can obtain the local optimal solution. Comprehensiveexperiments verify that the model can select good fetures and corresponding solvingalgorithm is very efficient.③We establish a large margin model of Trace Ratio-group sparse subspace forfeature selection (TR-GSLM). This method uses the nearest neighbor with the samelabel samples over of the nearest neighbor with the different label samples, if the margin is larger, the nearest neighbor with the same label should be as small as possible, andthe nearest neighbor with the different label should be as small as possible. We use thetrace ratio to obtain the objective function and then incorporateL2,1norm into the theobjective function. For the non-convex objective function, we propose the novelalgorithm which can obtain the global optimal solution. The convergence of thealgorithm is proven. Comprehensive experiments verify the model is better than GSLMfor feature selection, but it is lower efficiency of the algorithm.④We establish an enhanced large margin model of Trace Ratio-group sparsesubspace for feature selection (ETR-GSLM). We take substitution variables to improvethe efficiency of the TR-GSLM algorithm. Although the objective function is extremelycomplex and non-convex, the algorithm can obtain global optimal solution. In order toguarantee the positive definiteness of the sample marginal matrix, we present the newcorrecting method for indefinition matrix. If the marginal matrix is positive definitematrix, the algorithm is the convergence.Comprehensive experiments are conducted to compare the performance of theproposed algorithms with the other five state-of-art algorithms RFS, SPFS, mRMR, TR.Our algorithms achieve better performance than the five alogorithm. The proposedalgorithms are not sensitive for kernel function parameters and the regularizationparameters. Compared with the algorithm LLFS and TR, the proposed GSLM algorithmhas acompetitive performance with a significantly faster computational time.

  • 【网络出版投稿人】 重庆大学
  • 【网络出版年期】2014年 02期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络