节点文献

基于最短描述长度的高维特征选择方法研究

Mdl-based Feature Selection for High Dimensional Data

【作者】 刘峤

【导师】 秦志光;

【作者基本信息】 电子科技大学 , 信息安全, 2010, 博士

【摘要】 高维特征选择问题也称为稀疏建模问题,是当前机器学习研究领域的热点研究问题之一,目标是解决现有的特征建模方法在高维特征空间普遍失效的问题。主要的研究方法是基于模型参数的1-范数或零-范数约束的正则化方法。当前流行的1-范数方法存在的主要问题是缺乏对相关特征的组选能力和特征选择能力受样本容量限制。而传统的零-范数方法则在稀疏建模实践中普遍存在过拟合问题,主要原因是对模型复杂度的约束条件不合理。最近的理论研究揭示出基于零-范数约束的逐步回归法在理论上能够获得比1-范数方法更好的稀疏建模性能。据此本文从最短描述长度原则出发,通过理论推导建立了三种新型的基于零-范数约束的高维特征选择方法模型,分别是:1.通过向随机复杂度模型中引入模型参数的高斯分布假设,对模型复杂度下界的费舍尔信息近似公式进行推导求解得到一个易于计算的特征选择判据,据此构造出一种基于随机复杂度约束的特征选择方法模型,并通过仿真实验和真实基因数据集上的实验,验证了该方法在稀疏建模任务中的性能优于当前主流的1-范数方法和文献报道的最新相关理论成果;2.通过向基于风险膨胀判据(RIC)的特征选择模型中引入2-范数约束条件,解决了RIC模型从低维特征空间向高维特征空间的推广问题,据此构造出一种基于有偏风险约束的特征选择方法模型,并同样通过仿真实验和基因选择实验验证了该方法在稀疏建模任务中相对于当前主流方法的性能优越性; 3.为尝试建立推广性更好的零-范数高维特征选择方法模型,本文在吸收借鉴前述方法的优点的基础上,通过向随机复杂度模型引入一个Tikhonov类型的正则化因子,削弱了该模型的理论限制条件,据此构造出一个基于有偏最短描述长度的特征选择方法。仿真实验,基因选择及图像分类实验的数据表明,该方法能够有效处理稀疏建模任务,且性能优于当前主流的1-范数方法和和文献报道的最新相关理论成果,在本文提出的三个模型中性能表现最优。上述研究成果证明了基于零-范数的正则化特征选择方法不仅适用于高维特征空间,而且能够获得比1-范数方法更好的稀疏建模性能。同时本文提出的方法模型为解决高维特征选择问题提供了新的研究思路和有希望的解决方案。

【Abstract】 Feature selection for high-dimensional data, also called sparse modeling problem, is one of the hot issues in current machine learning research area, aimed at addressing the problem that a majority of the existing methods may fail to produce meaningful results in high dimensional feature spaces. Regularization method is the most popular methodology used in the current study, especially the L1-penalized and the L0-penalized methods. However, prevalent L1-norm regularization approaches share some inherent theoretical drawbacks, such as lack the ability to select out grouped features, and can not select more features than the sample size. Besides, most of the traditional L0-penalized methods are prone to over-fitting in data-sparse environments, mainly due to lack of constraints on the model complexity.Recent research has revealed that stepwise regression with L0 regularization can perform better than L1 algorithms in sparse cases. We therefore proposed three novel L0-penalized high-dimensional feature selection approaches derived from the minimum description length principle (MDL), namely:1. An easily computable feature evaluation criterion, which was derived from the Fisher information approximation to the stochastic complexity criterion, and was simplified by imposing a Gaussian assumption on the parametric model. Then we proposed a novel stepwise regression approach for sparse modeling based on such a simplified stochastic complexity criterion. Numerical results on synthetic and real microarray open datasets show that the proposed approach outperforms prevalent L1-penalized methods and other cutting-edge alternative approaches.2. A biased risk inflation criterion for feature selection, in which we managed to improve the generality of the RIC approach by introducing an L2-penalty to the model parameters in combining with the risk inflation criterion. Based on such a novel evaluation criterion, we proposed another stepwise regression approach for choosing between nested parametric models based on the biased RIC. Empirical studies also demonstrated its superiority to other popular alternatives tested above.3. The third approach was proposed with the intention to borrow strength from the above two approaches and thus to construct a more general solution to the sparse modeling problem. By employing another Tikhonov-type penalty to the parametric model combined with the stochastic complexity criterion, we set up a novel biased minimum description length based feature selection method, which was shown to successfully mitigate the theoretical risks and limitations of the pure MDL approaches. Experimental results on synthetic datasets, real microarray datasets and real image spam datasets show that the proposed method can deal with all kinds of sparse modeling tasks efficiently, it outperforms prevalent L1-penalty methods and other alternatives in all the tests, and demonstrates a superiority over the above two approaches.The theoretical studies and corresponding experimental results provide new and strong evidence for the validity of the claim that stepwise regression with L0 regularization can perform better than L1 algorithms in sparse cases. It is also the author’s hope that the proposed findings and conclusions will provide a new perspective and will encourage more investigations along the lines suggested.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络