节点文献

基于化学数据的若干统计学习新方法研究

Research on Some New Methods of Statistical Learning Based on Chemical Data

【作者】 黄新

【导师】 许青松;

【作者基本信息】 中南大学 , 概率论与数理统计, 2013, 博士

【摘要】 针对日益复杂的数据,特别是在构效关系研究和谱数据分析领域,如何运用统计学习方法从中挖掘出最有用的信息,这是当今应用统计研究的一个热点问题之一。本文在数据驱动模式的指导下,以化学数据为背景,经过深入研究一些经典的统计方法,如分类与回归树,支持向量机,偏最小二乘等的优势与不足之后,创造性地构造了一种新颖的树核,提出了一系列新的统计学习方法。研究内容主要包括七章。首先,简要介绍了本文的研究背景与动机。然后较详细地总结与探讨了化学数据分析中一些常用的理论及其方法,指出了它们各自的优点与不足,这些是我们研究统计新方法的基础。最后介绍了本文的主要内容和创新之处(第一章)。第二章树核的构造是我们第一次提出来,是我们的重要创新之一。在深入研究CART原理的基础上,我们首次指出同一终节点中的样本不仅仅具有类别的相似性,可能拥有其它某种特定的相似性。同时为得到结构多种多样的树,我们将蒙特卡洛方法耦合到分类树算法中,通过使用fuzzy修剪和集成策略,巧妙地构造了一种新颖的树核。这fuzzy修剪策略,能够有效的探索节点内部的信息,但不完整地破坏树的结构。集成策略能更加体现数据中的有规律的信息,使得结果更稳定。这是我们构造树核的原始动机。在构建树核的过程中,通过建立大量的树模型,为了寻找与分类最相关的变量集以及在不同变量空间中具有特定相似性的样本集,分类树模型同时在变量空间和样本空间执行一个贪婪但不一定是全局最优的搜寻。这样,大量的树模型能够有效地发现样本之间的相似性,同时,能够评估每一个变量的重要性。自然地,我们构造的树核具有以下优点:第一,它是属于有监督学习,因为在核的构造过程中,类的信息暗示着树的结构。第二,由于无关的变量对树集群的贡献很小,这样它们对树核的测量值的影响就很小,从而能够有效地发现重要变量。第三,由于结合了分类树算法,它能够处理非线性问题。然后在核方法的框架下,我们将构造的新颖树核融入到支持向量机,偏最小二乘和k-最近邻算法中,提出了三种新的统计学习方法:树核支持向量机(TKSVM),树核偏最小二乘(TKPLS)和树核k-最近邻分类方法。三个SAR数据集的实证结果表明,构造的树核所具有的优点能够有效改进这些传统的算法。针对高维光谱数据,我们提出了一种新的建模方法PLSSIS。高维光谱数据(如近红外)分析的困难在于量测的数据在呈现出很高共线性的同时,含有大量的冗余信息。通常会应用PLS方法来处理。然而,PLS方法所建立的模型包括了所有的原始变量,其中包含冗余信息,这会降低模型的预测性能。我们通过运用PLS回归系数,结合安全独立筛选SIS (sure independence screening)原理来逐步选择重要的变量,提出了一种基于安全独立筛选的偏最小二乘回归(PLSSIS)的新变量选择策略。PLSSIS是一种结合了PLSR和SIS的前向迭代算法,能够快速有效地处理高维共线性数据。三个光谱数据集实验结果表明,比较标准的PLS方法和移动窗口偏最小二乘方法回归MWPLSR(moving window partial least squares regression), PLSSIS方法选择了更少的变量,具有更好的可解释性与预测性能。最后,第七章对全文进行了总结并对今后的研究提出了展望。

【Abstract】 For the increasingly complex data, especially in the field of structure-activity relationship and spectra data, how to mine the most useful information from the complex data by statistical learning methods is one of the hot issues in current applied statistics research. Under the guidance of "data-driven", in the background of chemical data, through in-depth study the advantages and disadvantages of some classical statistical methods, such as classification and regression tree, support vector machine, partial least squares, etc. we proposed creatively some new statistical learning methods. The thesis consists of seven chapters.Firstly, we briefly introduced the research background and motivation, and then reviewed some theories and methods of statistical learning on chemical data analysis. These are the foundation of the new methods of statistical learning. Finally, we introduced the main content and innovation of this thesis in Chapter1.In Chapter2, the constructed tree kernel is proposed for the first time, which is one of the most important innovations. We discussed in detail the classification and regression tree(CART) algorithm. We pointed out that the samples under the same terminal node may possess some specific similarity to some extent, rather than only being limited to class similarity. Simultaneously, in order to obtain the diversity of tree structures, We coupled Monte Carlo procedure with a classification tree algorithm, and skillfully constructed a novel tree kernel by using the fuzzy pruning strategy and ensemble strategy. The fuzzy pruning strategy helps in effectively exploiting the information of inner nodes in trees, but does not totally destroy the structure of tree. Ensemble strategy selection can effectively guarantee that the results by tree kernel is more stable and reliable compared to one by CART, not deriving from the chanciness. This is our original motivation of building tree kernel. In fact, CART carries out a greedy but may not be global optimal search in sample and variable to seek for variable subsets most relevant to classification and sample subsets with specific similarity under different variable subspace. The constructed tree kernel has several outstanding advantages:It is "supervised" because the class information dictates the structure of the trees in the process of constructing tree kernel; Because irrelevant metabolites contribute little to the tree ensemble, they have little influence on the proximity measure, and tree kernel thereby can easily discover the inportant variable; By means of the classification tree, constructed tree kernel can effectively deal with nonlinear problems.Then, under the framework of kernel methods, we coupled a novel tree kernel with support vector machine, partial least squares and k-nearest neighbor, and presented three new statistical learning methods: tree kernel support vector machine (TKSVM),tree kernel partial least squares (TKPLS) and tree kernel k-nearest neighbor (TKk-NN). Three datasets related to different categorical bioactivities of compounds are used to test the performance of these methods. The results show that advantages of constructed tree kernel can effectively improve the traditional methods.For the high-dimensional spectral data, we proposed a novel model method PLSSIS. A difficulty of high-dimensional data analysis lies in multi-collinear and a lot of redundant information. PLS can be usually employed to deal with this case. However, calibration model including all the variables contains much redundant information, which will bring about negative influence on the prediction ability of the model. By employing PLS regression coefficients and sure independence screening principle, a novel strategy for selecting stepwise the variables, named PLS regression combined with sure independence screening (PLSSIS), is developed. PLSSIS is a forward iteration algorithm that combines the PLSR with SIS, which can fastly and efficiently deal with the high dimensional collinear data. For three spectral datasets, Our study shows that better prediction is obtained by PLSSIS when compared to PLS modeling and moving window partial least squares regression (MWPLSR).At last, Chapter7is the summarization of whole thesis and expectation for the future.

  • 【网络出版投稿人】 中南大学
  • 【网络出版年期】2014年 02期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络