节点文献

支持向量回归机的改进及其在植物保护中的应用

Improved Support Vector Regression and Its Application in Plant Protection

【作者】 谭泗桥

【导师】 柏连阳; 袁哲明;

【作者基本信息】 湖南农业大学 , 植物病理学, 2008, 博士

【摘要】 植物保护研究中存在大量回归建模问题。这些问题多属非线性范畴,传统方法如多元线性回归、逐步线性回归等线性方法的应用受到限制;基于经验风险最小的非线性方法如人工神经网络,虽具有较好的非线性逼近能力,但难以克服维数灾难和局部极小问题,且小样本情况下过拟合严重,预测误报风险大。统计学习理论(Statistical Learning Theory,SLT)是在研究小样本统计估计问题过程中发展起来的新兴理论,该理论的最大贡献是提出了结构风险最小化准则并基于该准则建立了支持向量机(support vector machine,SVM)方法。支持向量机包括分类(support vector classification,SVC)和回归(support vector regression,SVR)两类,它较好地解决了非线性、小样本、过拟合以及维数灾难等问题,具全局最优、泛化推广能力强等优点,已在多个领域得到广泛应用,但用于植物保护的报道较少。本文对支持向量回归机(SVR)存在的缺陷进行改进,发展了若干新的算法,在此基础上,将其应用于植物保护中纵向数据回归(以多维时间序列分析为例)和非纵向数据回归(以农药定量构效关系建模和饲料配方优化为例)研究,主要研究内容及成果如下:(1)改进了SVR的若干缺陷。SVR的核函数选择依赖经验而缺乏理论指导;本文依均方误差(mean squared error,MSE)最小原则发展了从4种常用核函数中自动选择最优核函数的方法。在非线性体系中,以逐步线性回归等线性方法筛选自变量存在弊端;本文基于SVR发展了“多轮末尾淘汰法”,从包含全部输入描述符的SVR模型中以留一法基于MSE最小原则非线性逐次剔除对提高预测精度不利的自变量,剩下的为保留自变量。SVR的另一缺陷是不存在一个显性的表达式,可解释性差;本文基于SVR发展了“多轮末尾强制淘汰法”,可给出各保留自变量对预测精度影响程度的重要性顺序,使SVR具备了部分的解释能力。结合多因子多水平复杂配方优化问题,本文给出了SVR模型回归与偏回归显著性测验的F测验方法,进一步提升了SVR的可解释性。为在小样本前提下评估SVR模型预测可信度,本文发展了“双重留一法”,从最优函数及其保留自变量出发,规格化后再次以留一法搜索寻找最优SVR参数,并基于最优参数对样本训练后实施预测,“双重留一法”近似于独立测试。在上述改进的基础上,建立了SVR在回归分析中的应用框架。(2)基于SVR发展组合预测方法用于农药定量构效关系建模。组合模型预测精度较单一模型更高,本文构建了两个组合预测模型。第一个模型针对样本集存在的异质性,基于SVR结合K-近邻法构建组合模型,核函数寻优和描述符筛选后再依不同近邻的子模型以双重留一法实施组合预测,从行方向(样本)和列方向(描述符)进行优化,提高了预测精度。第二个模型考虑到小样本建模困难,基于学习能力强的局部核函数和推广能力强的全局核函数构建模型,即以径向基核函数与多项式核函数为子模型构成组合样本,基于SVR实施核函数寻优与描述符筛选后以双重留一法实施预测,该方法较线性组合方法精度更高。上述两种方法分别用于不同农药的定量构效关系建模,预测精度均优于文献报道结果。(3)基于SVR优化多因子多水平复杂配方。实施少量试验,优化配方并解释各因子效应意义重大。本文首先以文献报道的小菜蛾饲料配方优化数据为例,建立基于SVR优化配方、分析因子效应的方法学:从原始配方样本集出发,以SVR进行核函数寻优和自变量筛选后实施双重留一法预测,预测精度优于多元线性回归模型,表明非线性的SVR更适合用于优化配方。全组合预测后进行频次寻优,依各因子最优频次决定是否外推因子水平实施下一轮试验,以保证优化效果。此前对SVR模型的评价需以参比模型为参照,以MSE为指标进行比较;本文从F测验定义出发,构建了SVR回归的显著性测验方法。传统的二次多项式回归方程在依系数正负、大小解释各因子效应时,常出现一次项与二次项结果相互矛盾的情形,而一般的SVR模型又不具解释性;本文从偏回归分析定义出发,建立了基于SVR以F测验解释、评价各因子相对重要性的方法。基于SVR的单因子效应分析、双因子互作效应分析方法也一并给出。上例是依文献报道数据进行的方法学研究,本文进一步用一个研究实例来验证新方法的有效性:结合SVR和均匀设计,对12因子5水平井冈霉素发酵培养基配方的实际优化设计表明,仅通过20个处理,表征井冈霉素含量的OD560即由初始配方(生产厂家已优化配方)的1.72提高到2.22,且最终配方只保留了6个因子,对各因子效应分析合理,优化效果极为显著。(4)基于地统计学和SVR的多维时间序列分析。多维时间序列模型既要体现环境因子的影响,又要反映样本集的动态特征,其中相空间重构等(即定阶)是一个难点。本文将地统计学和SVR相结合,构建了多维时间序列分析的GS-SVR模型:以半变异函数分析因变量的结构性,依变程来确定因变量的拓展阶数,避免拓阶陷入局部最优;考虑到历史环境因子对当前预测因变量的效应大部分已在历史因变量中体现,故历史环境因子仅拓展1阶;定阶后以SVR实施核函数寻优与非线性自变量筛选,以主成分分析减少信息冗余并降低样本维数,最后基于SVR实施一步法独立预测。小麦赤霉病发病率和二代玉米螟危害程度两个多维时间序列实例验证表明,GS-SVR预测精度明显优于参比模型。

【Abstract】 Regression modeling has been researched in plant protection,and most on non-linear models.Traditional methods of linear regression,such as Multiple Linear Regression(MLR) and Stepwise Linear Regression(SLR) are limited.Non-linear methods based on Empirical Risk Minimization(ERM),such as Artificial Neural Network(ANN),are good at nonlinear approximation,but barely overcome high-dimension and local minimum point,and tend to be serious overfitting under the situation of small sample,with great risk of misstatement in predicting.Statistical Learning Theory(SLT) has been developed with the research on small sample statistics estimation,and its great contribution was on Structural Risk Minimization(SRM) principle,based on which Support Vector Machine(SVM) learning method was put forward.SVM provides with high efficiency and powerful algorithms,capable of dealing with issues under circumstances of high dimensional,non-linear,and small sample.It can be classified as(Support Vector Classification,SVC) and regression problems(Support Vector Regression,SVR),and has the advantage of global optimization and strong generalization ability.SVM has been used in many fields,but few reports of SVM are in plant protection.A lot of research on application of SVR in plant protection has been made in this paper.Several issues of SVR,such as kernel selection without rule,high-dimension reduction,less-decipherment and low confidence probability of model,have been improved,with new algorithms proposed.Based on these improvements,two methods of SVR modeling,longitudinal data regression(exemplified by multidimensional time series analysis) and non-longitudinal data regression(exemplified by modeling of structure-activity relationship in pesticide quantitative and optimization of feed formulation) in plant protection were analyzed systematically and deeply in this paper. Main conclusions are as following:(1) Defects of SVR were improved.Kernel of SVR selection lacks theoretical basis,depending only on experience.The author developed the method of selecting optimal kernel automatically from four common kernels,abiding by MSE minimum principle.It is unreasonable to reduce dimension by selecting nonlinear descriptors using linear method such as step-by-step linear regression.Multi-round optimization was proposed in the paper,by which in a nonlinear way we can gradually eliminates descriptors that are unfavorable for increasing prediction precision from SVR models including all input descriptors,according to Leave One Out and MSE minimum principle,and the rest are remained descriptors.Lacking dominant expression,the result that supports SVR is unlikely to be explained.The author proposed Multi-round Compulsory Optimization in the basis of Multi-round Optimization.In this method,the sequence of descriptors’ influence degree on prediction precision was given,so the model had a certain capacity for explaining.To reinforce reliability of SVR model under small sample situation,Secondary Leave-one-out was developed by the author. After normalizing optimal kernel and remained descriptors,the optimal SVR parameters are researched in Leave-one-out method,by which the specimen are trained and then prediction is made.Validation showed that Secondary Leave-one-out method is similar to independent testing.A basic technical frame was constructed for regressive analysis based on improved SVR.(2) Combinatorial prediction method based on SVR were developed for QSAR modeling.Precision of combinatorial model is higher than single model,so two kinds of combinatorial model were constructed for pesticide based on SVR.First,because most data are heterogeneous,kernel optimization and descriptors selection were carried out based on SVR,then combinational prediction was done by Secondary Leave-one-out method and KNN,optimization were took for samples and descriptors,so the precision is high.Second,because modeling becomes relatively hard when small sample is processed,another combinational model for small sample set QSAR study was constructed,and kernel optimization and descriptors selection were also carried out to make prediction.This model is assembled by local kernel(RBF-kemel) and global kernel(poly-kemel),its precision is obviously higher than linear method.These two methods were employed for different pesticide QSAR modeling and the results are better than documents’.(3) The author optimized complex culture media with multi-parameters and multi-levels based on SVR.It’s meaning to optimize formulation and analysis effects of factor by few experiments.By taking optimization for culture media of diamondback moth(DBM) as example,theoretical research of SVR model in media optimization was carried out.Based on initial composition,Secondary Leave-one-out was carried out after kernel optimization and descriptors screening,the precision is higher than linear regression model,it is showed that SVR is proper for media optimization.Frequency statistics based on all-combination ensure level of factor extrapolation will be taken or not.The former method of model evaluation need compare with other model by MSE value.This paper constructed a method to optimize culture media and analysis effect of factor based on F test.It is antinomy to analysis factor effects according to first and second order term index of quadratic polynomial,so this paper put forward a new method to explain and evaluate factor effect by F test based on partial regression sum of squares.Evaluation of single factor effect and interaction of double factors were proposed at the same time.Effectiveness of this method was evaluated by a real experiment.Uniform design and SVR assembled to optimize the culture medium that own 12 factors and 5 levels of streptomyces hygroscopicus var Jing-gangensis Yen,the OD560 of satisfactory composition is 2.22,obviously higher than the initial OD560(1.72),just with 6 factors. The model is reasonable to explain factor effect and is powerful to optimize culture media.(4) Multi-dimensional time sequence was analyzed by GS-SVR model,which was constructed on the basis of geo-statistics and SVR.The model needs to character circumstance factor effects and dynamics characteristics,and the length of dynamics characteristics is hard to ensure.The author analyzed structure of data by semi-variation functions of geo-statistics,and defined the expansion exponent number of time sequence,to avoid local optimization of exponent- expanding result.Effects of historical circumstance factor for variable prediction have been embedded in historical variable,so historical circumstance factor just to expand one year.Kernel optimization and nonlinear descriptors selection after exponent-expanding and the following principal component analysis(PCA) could reduce data redundancy.Finally independent prediction was carried out with SVR.Prediction models for diseased panicle rate of wheat scab and damage degree of the 2nd generation corn borer were constructed and the result showed that methods based on SVR have the advantages of high prediction precision and stability.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络