节点文献

多模型共识数据建模方法研究

Studies on Methods of Consensus Data Modeling

【作者】 苏振强

【导师】 蔡文生;

【作者基本信息】 中国科学技术大学 , 分析化学, 2006, 博士

【摘要】 分析化学数据的建模是化学计量学研究的重要内容。根据数据建模的任务不同,可以分为回归校正(regression)和模式识别(pattern recognition)。由于传统的单模型建模方法对数据中的噪声和样本量都比较敏感,在分析复杂的化学测量数据时,容易受到数据中噪声或样本量的影响,使模型的普适性(generalization performance)大大降低。为了弥补单模型建模方法的不足,近年来,多模型共识建模(ensemble modeling或consensus modeling)方法受到普遍重视,在很多研究领域得到广泛的研究和应用。本论文将多模型共识建模方法用于近红外光谱和基因芯片(microarray)数据的建模与分类,并对多模型共识建模方法的基本理论和应用进行了探讨,主要内容包括:1.综述了分析化学数据建模的基本原理以及常见的建模方法,重点总结了多模型共识建模的基本理论、常用建模方法以及应用现状。2.研究了随机抽样法多回归模型共识建模方法,提出了一种基于偏最小二乘(PLS)的多回归模型共识算法cPLS。该方法不是只利用预测性能最好的单个模型来预测未知样本,而是采用随机抽样技术扰动训练集,建立一系列的PLS模型,并从中选择部分预测性能较好的模型共同预测未知样本。通过对玉米近红外光谱数据的校正分析,结果表明,cPLS的预测性能要比普通PLS模型好,采用多个PLS模型的共识,不但提高了PLS模型的预测精度,而且也提高了PLS模型的普适性。3.将局部建模技术与多模型共识方法相结合,提出了一种动态建模多模型共识算法CDL-PLS。与普通PLS和基于bagging/boosting的PLS算法不同,CDL-PLS采用一种局部动态建模方法训练成员PLS模型,用于训练成员PLS模型的样本不是从原训练集中随机选取,而是根据训练集样本与未知预测样本之间在主成分空间的欧几里得距离来选取。通过对烟叶样品近红外光谱数据的校正分析,结果表明,局部动态建模技术可以提高PLS模型的预测精度和稳定性,而多个局部动态PLS模型的共识,可以进一步提高模型的预测精度和普适性。4.采用特征变量选择和非重复特征变量相结合的方法,建立了多分类器共识分类方法CAMCUN(consensus analysis of multiple classifiers using non-repetitive variables)。CAMCUN根据特征变量的预测能力有选择地建立非重复特征变量成员分类器,使各成员分类器之间尽可能不相关,以增加成员的多样性。通过对基因表达谱数据的分析,结果表明,CAMCUN的预测精度和普适性比其成员分类器有较大的提高。另外,对CAMCUN的偶然相关性(chance correlation)和预测结果的可信度(prediction confidence)分别进行了评估,研究结果表明,通过多分类器的共识,CAMCUN的偶然相关性降低而预测可信度得到了提高。5.研究了模式识别过程中特征变量的选择方法,提出了一种不相交主成分分析(disjoint principal component analysis)和遗传算法(genetic algorithm,GA)相结合的特征变量选择方法,并将其应用于基因表达谱数据中差异表达基因的识别。不相交PCA用于评估不同基因组合在区分两类样品时的区分能力大小,由于考虑了基因之间的组合效果,更加符合基因在生物体内发挥调控作用的实际情形。GA用于优化不同基因间的组合。此外,还提出了一种新的统计方法,对差异表达基因的偶然相关性进行了评估。研究结果表明,与文献中常用的差异表达基因识别方法t-检验和SAM(significance analysis of microarray)相比,新方法识别的差异表达基因具有更强的区分能力。

【Abstract】 Modeling of analytical data is a common task in chemometrics. There are two types of problems in the modeling of analytical data, namely regression (or calibration) and pattern recognition. Because a single model is inherently susceptible to the difficulties associated with data quality and sample number. In this dissertation, consenesus strategy was used in the modeling of NIR spectroscopy and microarray data, and the theories and application of consensus modeling were investigated, including the following works:1. The basic theories and frequently used methods for the modeling of analytical data were reviewed, and the basic theories, modeling methods and application of consensus modeling were summarized as an emphasis.2. Based on random resampling, a partial least squares-based consensus regression method cPLS was proposed. In cPLS, other than selecting one PLS model on the basis of the best fit, several PLS models satisfying a predefined criterion were selected and combined into one cPLS. The effectiveness of cPLS was demonstrated by comparing the prediction results to those from the regular PLS in an application for the calibration of the NIR spectra of corn samples. The results suggested that combining multiple individual PLS models by cPLS could improve not only the accuracy of prediction, but also the robustness of the model.3. Combination of local modeling with consensus modeling, a consensus dynamic local partial least squares, CDL-PLS, was proposed. Unlike a regular PLS and many consensus methods reported in the literatures which used bagging or boosting to generate constituent predictors, CDL-PLS generates constituent models using a dynamic local modeling technique, which is different from bagging or boosting in that the samples used to develop constituent predictors are not randomly selected from the original training data set but according to their Euclidean distances to the predicting unknown sample. The effectiveness of CDL-PLS was demonstrated by comparing its prediction results to those of a general PLS in an application for the calibration of the near-infrared (NIR) spectral data of tobacco lamina samples. It was found that the use of dynamic local modeling technique could increase the prediction accuracy and stability of a predictor, while the combination of multiple dynamic local PLS models could further improve the prediction accuracy and robustness of a predictor.4. A new classification method CAMCUN (consensus analysis of multiple classifiers using non-repetitive variables) was developed. The central idea of CAMCUN is to combine multiple, heterogeneous classifiers, each derived with distinct features selected according to discriminatory power. CAMCUN was applied in analysis of microarray gene expression data. The analysis including classification of cancer based on gene expression profiles, assessing the chance correlation and the prediction confidence of classifiers, and identifying biomarkers. It was found that CAMCUN give much better prediction accuracy with higher prediction confidence and lower chance correlation than any of the constituent classifiers.5. By integration of disjoint principal component analysis with genetic algorithm (GA), a new feature selection method for pattern recognition was developed and applied in identification of differentially expressed genes from microarray gene expression profiles. In this method, the discriminatory power of combination of genes was obtained from disjoint PCA. GA was used to search for the best combination of genes. The significance in differential expression of individual gene was assessed by a statistic method. It was found that the differentially expressed genes identified using this method showed stronger discriminatory power than those obtained from t-test and SAM (significance analysis of microarray).

节点文献中: 

本文链接的文献网络图示:

本文的引文网络