节点文献

基于特征选择的多变量数据分析方法及其在谱学研究中的应用

Multivariate Data Analysis Methods Based on Feature Seletion and Their Applications in Spectroscopic Study

【作者】 张明锦

【导师】 杜一平;

【作者基本信息】 华东理工大学 , 分析化学, 2011, 博士

【摘要】 特征选择是多变量数据分析中一个重要的研究方面,通过特征选择可以剔除无关、冗余的信息,降低数据维数及算法的复杂度,提高模型的推广能力及可理解性,因而在数据分析中起着很重要的作用。本文以蛋白质组学质谱数据以及近红外光谱数据为研究对象,进行了高维数据特征变量选择方法的研究。对蛋白质组学质谱数据的分析目的是进行潜在生物标记物的探寻及疾病和健康样本的模式识别;对近红外光谱数据的研究目的是通过变量筛选消除数据共线性的影响,从而建立稳定、高效的多元校正模型。本文研究工作主要包括以下几个方面:(1)提出了一种基于非相关线性判别分析的演进式特征选择方法,该方法包括数据降噪及标准化、数据分箱及箱变量筛选、箱数据处理、非相关线性判别分析用于特征筛选及样本分类等四个步骤。通过对卵巢癌血清样本SELDI-TOF质谱数据的分析筛选得到了可用于识别卵巢癌样本的潜在生物标记物,并建立了分类模型,得到了100%的灵敏度和特异性。(2)提出了一种独立成分分析结合非相关线性判别分析的特征选择方法。该方法包括三个步骤:1)独立成分分解;2)非参数统计检验用于判别独立成分的选择;3)非相关线性判别分析用于潜在生物标记物的筛选及分类模型的建立。用本方法对一组结肠癌数据集和一组卵巢癌数据集分别进行了分析,最终筛选出的特征所建立的分类模型在两组数据上的灵敏度均为100%,特异性分别为100%和96.77%。(3)建立了一种基于F-score与偏最小二乘—判别分析的特征选择方法,首先通过预处理,提取出质谱信号中的峰值,然后按F-score值大小对变量的可分类性排序,最后以PLS-DA逐步有放回地筛选出潜在的生物标记物。对结肠癌和卵巢癌数据集进行了分析,最终得到的特异性分别为100%和96.77%,灵敏度分别为95.24%和100%。(4)提出了一种基于蒙特卡罗采样技术的递归偏最小二乘方法,该方法采用蒙特卡罗采样技术建立多个数据子集,并利用PLS分别对每个子集多次建模,以回归系数为变量筛选依据选出多个优变量子集,通过统计分析确定最终的最佳变量集。用此方法对几个不同的近红外光谱数据集进行分析,并与不同方法进行了比较,结果表明该方法可有效地进行近红外光谱的变量筛选。(5)提出了一种基于光谱纯度值的变量选择方法,用于近红外光谱定量建模中的波长选择。对光谱中各变量计算其纯度值后,按降序将相应变量排列,采用PLS交互检验通过依次考察变量对模型的贡献逐步选择最佳变量。用此方法对几个不同的近红外光谱数据集进行变量筛选,结果表明此方法简单、有效。

【Abstract】 Feature selection is one of the most important aspects of multivariate data analysis. Through feature selection, both of the redundant and irrelevant information can be eliminated and the data dimensionality can be reduced, so that the computational processing is simplified. Furthermore, it can improve the generalization performance and understandability of models. Thus, feature selection plays an important role in data analysis.This dissertation studied the feature selection methods for high dimensional data, the proteomic mass spectrometric (MS) data and near-infrared spectroscopic (NIRS) data were taken as research object. The main aims for proteomic MS data analysis was potential biomarker finding and samples classification, for NIR data analysis was wavelength selection for elimination of the effect of co-linearity and effective modeling.The main works in this dissertation are as follows:(1) A feature selection method called ULDA-HFS (uncorrelated linear discriminant analysis based heuristic feature selection) was proposed, which mainly include three steps:(a) dimensionality reduction and data normalization; (b) data binning and discriminant bin selection; (c) ULDA for feature selection and sample classification. An ovarian cancer serum SELDI-TOF (surface enhanced laser desorption/ionization-time of flight) MS dataset was analyzed with the proposed method, and obtained several potential biomarkers which could discriminate ovarian caner samples from healthy samples, the classification model built by the potential biomarkers obtained 100% of specificity and sensitivity.(2) A strategy based on Independent Component Analysis (ICA) and ULDA was proposed for proteomic profile analysis and potential biomarker discovery from proteomic mass spectra of cancer and control samples. The method mainly includes 3 steps:(a) ICA decomposition for the mass spectra; (b) selection of discriminatory independent components (ICs) using nonparametric test; and (c) selection of special peaks (m/z locations) as potential biomarkers and create classification models by ULDA.. A colorectal cancer data set and an ovarian cancer data set were analyzed with the proposed method. The classification results yielded 100% and 96.77% of specificities on colorectal and ovarian cancer datasets respectively,100% of sensitivity on both of the datasets.(3) A feature selection method based on F-score and partial least square-discriminant analysis (PLS-DA) was presented. After preprocessing, peaks consist in the signals were picked and the variables were sorted according to their F-scores, then, potential biomarkers were selected by performing PLS-DA in forward selection strategy. The classification results of the potential biomarkers selected by the proposed method yielded 100% of specificity and 95.24% of sensitivity on a colorectal cancer dataset, and 96.77% of specificity and 100% of sensitivity on an ovarian cancer dataset.(4) Proposed a feature selection method named Monte Carlo Sampling-based Recursive Partial Least Squares (MCS-RPLS), which create a number of sub-dataset by using Monte Carlo sampling technique firstly, then modeling with PLS on each subset repeatedly and select feature subset on each dataset by taken regression coefficient as criterion, finally determine the optimum feature set through statistical analysis on the feature subsets. The method was used for analysis of several NIR datasets and compared with several methods, the results shown that the method could effectively select useful features from NIR data for multivariate calibration.(5) A feature selection method based on purity of spectral variable was proposed and used for wavelength selection from NIR dataset for quantitative modeling. After calculation of the purity of each spectral variable (i.e. wavelength), sort the variables using purities in descendent way and select optimum variables step by step, where the contribution of each variable for calibration model was tested with PLS cross validation. The method was used for analysis of several NIR datasets and the results indicated its simplicity and availability.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络