节点文献

近红外光谱分析模型优化和模型转移算法研究

Studies on Model Optimization and Model Transfer Methods of Near Infrared Spectroscopy

【作者】 郑开逸

【导师】 杜一平;

【作者基本信息】 华东理工大学 , 分析化学, 2013, 博士

【摘要】 近红外(NIR)光谱由于信号强度低、谱峰重叠严重等特点,故需要用化学计量学手段建立数学模型来提取有化学意义的信息。为了提高模型的预测效果,NIR光谱模型需要优化;为了提高NIR光谱模型的通用性,必须实现模型转移。NIR光谱模型的优化包括光谱预处理以及变量选择等手段。在光谱预处理方面,本文研究了基于分数阶Savitzky-Golay求导的光谱预处理方法。分数阶Savitzky-Golay求导的光谱预处理方法是对整数阶Savitzky-Golay求导的推广,而整数阶Savitzky-Golay求导则是分数阶Savitzky-Golay求导在阶次为整数条件下的特例。和整数阶Savitzky-Golay求导类似,分数阶Savitzky-Golay求导通过构造奇数点的窗口,先拟合出待求导的多项式的系数。然后,根据Riemann-Liouville对分数阶导数的定义,以及之前拟合的多项式系数,通过对原光谱线性组合,得出分数阶求导的结果。分数阶Savitzky-Golay求导不需要使用繁琐的数学公式,只需构造出对角带状矩阵,将其右乘光谱矩阵即可实现求导计算。我们通过柴油数据,小麦数据、玉米数据对该方法实行验证。结果发现,在固定窗口以及多项式次数的情况下,分数阶导数能获得比整数阶导数更详细的信息,且其计结果的交互检验均方根误差(RMSECV)以及预测均方根误差(RMSEP)均小于整数阶求导。当预测结果为样品粘度、密度、硬度等非组分含量信息时,其计算结果明显优于整数阶求导。在变量选择方面,本文研究了基于变量稳定性的竞争性自适应加权抽样法(SCARS)。该方法通过构造若干个变量集合。对每个集合中的变量,该方法通过Monte Carlo方法计算变量的稳定性,以此作为变量重要性的指标。之后,用基于指数函数的强制删除法以及竞争性自适应加权抽样法(ARS)对变量进行删除。对剩下的变量集合重复上述过程进行变量选择(重新计算稳定性,强制删除,ARS)。最后对每个集合的结果进行交互检验,选择RMSECV最小的集合作为最优集合。我们用烟草数据、玉米数据以及小麦数据对这个方法进行验证。结果发现,基于SCARS选择的变量集,其计算结果的RMSECV值以及RMSEP值均小于移动窗口法(MWPLS), Monte Carlo无信息变量消除法(MCUVE)以及竞争性自适应加权抽样法(CARS)。我们还考察了变量选择导致过拟合问题。我们通过随机数产生的无分类意义的数据,用SCARS法,CARS法以及MCUVE法进行变量选择,结果发现对于这些无分类意义的数据,变量选择方法居然能够选择一些“较好的”变量组合,使其校正集的计算误差极大地减小,且原数据变量数越大,分类的结果“越好”。除了分类数据之外,我们对随机产生的回归数据也做了研究,也发现了同样的现象。这种异常的结果揭示了变量选择也会导致过拟合,从无信息数据中找到一些“好的”变量组合,使变量选择的结果偏向于校正集。为了研究这种现象的产生原因以及预防策略,我们用烟草尼古丁数据作为有信息组分,然后添加和有信息数据成不同比例的无信息数据构造模拟数据。然后将这模拟数据,分为校正集以及独立测试集两部分。其中校正集用SCARS方法进行变量选择,对每一个变脸选择的集合,我们不仅计算其校正集的RMSECV值,同时用校正集建模计算其独立测试集的RMSEP数值。考察随着变量集合的收缩,RMSECV以及RMSEP的变化情况。结果发现,对于以噪声作为无信息数据,当噪声的标准差小于等于有信息光谱标准差均值0.02倍时;对于以重排光谱作为无信息组分的数据,无信息组分的强度小于等于有信息光谱强度的0.1倍时,RMSECV的的变化趋势和RMSEP乎一致。但是随着无信息组分的增加,其变化趋势的相似性变小。对于以噪声作为无信息组分的数据,当噪声的标准差大于有信息光谱标准差均值0.02倍时;对于以重排光谱作为无信息组分的数据,无信息组分的强度大于有信息光谱标准差均值0.1倍时,RMSECV以及RMSEP变化趋势有显著差异。比较变量选择中RMSECV以及RMSEP变化趋势图可用于检验变量选择算法的有效性:当二者变化较小时候,可以认为变量选择是有效的;而当二者差异较大时,则变量选择算法是无效的。在模型转移方面,本文研究了基于光谱中有信息成分的模型转移方法。通过预测向量的偏最小二乘法(PLS)分别从主光谱和从光谱中提取与预测值建模相关的信息。之后,用基于光谱校正的模型转移法(典型相关分析法(CCA)、直接校正法(DS)以及预测矩阵的偏最小二乘法(PLS2))将从光谱的有信息成分转移成主光谱的有信息组分。最后将转移后的有信息组分代入主光谱的模型进行预测。我们用玉米数据、三组分体系数据以及人工配置的牛奶中富马酸二甲酯数据,对这种模型转移方法进行了验证。结果显示,对于基于光谱转移的模型转移法,基于光谱中有信息组分的转移的结果要好于基于全光谱的模型转移。

【Abstract】 In order to overcome the drawbacks of near infrared (NIR) spectroscopy, such as low absorption intensity and overlapped bands, chemometrics methods are used to construct models to extract chemical information. For the purpose of improving the prediction ability, the models should be optimized by spectral pretreatment and variable selection. And in the aim of improving generality of the models, the models should be executed calibration transfer.On aspect of spectral pretreatment, this paper applied fractional order Savitzky-Golay differentiation to preprocess NIR spectra. The fractional order Savitzky-Golay differentiation is the generalization of ordinary Savitzky-Golay differentiation (integral order Savitzky-Golay differentiation) while the ordinary Savitzky-Golay differentiation is the special case of fractional order Savitzky-Golay differentiation at integral order. Similar as ordinary Savitzky-Golay differentiation, the fractional order Savitzky-Golay differentiation also obtains the parameters of polynomial by fitting the data in the window of spectra. Then, with the aid of Riemann-Liouville fractional calculus theory and the parameters of polynomial, the results of differentiation can be obtained by the linear combination of the data in the window. Without complex mathematical formula, the fractional order Savitzky-Golay differentiation can obtain the spectra differentiation results by multiplying a band diagonal matrix on the right of raw spectra. Three datasets including diesel, wheat and corn datasets were applied to test this method. The results showed that compared with ordinary Savitzky-Golay differentiation, the proposed method can obtain more details of spectra to obtain small values of and root mean square error of cross valudation (RMSECV) and root mean square error of prediction (RMSEP), especially for the non-chemical information containing viscosity, density and hardness.A new variable selection method called stability competitive adaptive reweighted sampling (SCARS) was proposed. In SCARS, variable is selected by an index of stability that is defined as the absolute value of regression coefficient divided by its standard deviation. SCARS algorithm consists of a number of loops. In each loop, the stability of each variable is computed. Then based on stability, enforced wavelength selection and adaptive reweighted sampling (ARS) is used to select important variables. The selected variables are kept as a variable subset and further used in the next loop. After running the loops, a number of subsets of variables are obtained and the RMSECV of partial least square (PLS) models established with subsets of variables is computed. The subset of variables with the lowest RMSECV is considered as the optimal variable subset. The performance of the proposed algorithm was evaluated by three NIR datasets:tobacco, corn and wheat datasets. The results show that the SCARS can supply the least RMSECV and RMSEP comparing with methods of Moving Window PLS (MWPLS), Monte Carlo uninformative variable elimination (MCUVE) and competitive adaptive reweighted sampling (CARS).Furthermore, the overfitting caused by variable selection was also explored. We applied variable selection methods including SCARS, CARS and MCUVE to select variables from dataset without classification information generated from randomly variables. To our surprise, for the dataset without classification information, the variable selection methods can still select some "good" variable combinations to separate "two classes" with "low" prediction errors. Furthermore, the prediction errors decreased with the number of raw variables ascending. In addition to classification, when the randomly variables without regression information were generated, SCARS still selected "good" variable combinations to obtain low prediction errors. In essence, the phenomenon that variable selection method can obtain "good" variable combinations from uninformative variables is overfitting. In order to research the causes and diagnostic methods of the overfitting problems, the tobacco dataset were used by adding uninformative data torawspectra at different ratios to generate simulated data. After the simulated data had been constructed, the data were divided into two parts:calibration set and independent test set. Finally, variable selection was executed to compare the variation paths of RMSECV for calibration set with the corresponding variation paths of RMSEP for independent test set. The results show that when the ratio values of uninformative data to spectra are small (equal to or smaller than0.02for noise data as uninformative data and equal to or smaller than0.1for randomly permuted spectra as informative data), the paths of RMSECV are similar as those of RMSEP. While the ratio values are higher than0.02for noise data as uninformative data and0.1for randomly permuted spectra as informative data, the paths of RMSECV are different from those of RMSEP. The comparison of the paths between RMSECV and RMSEP can be used to evaluate the effect of variable selection:the high similarity of two paths means variable selection is effective while low similarity means variable selection is ineffective.For calibration transfer, we proposed a new calibration transfer method which corrects informative components instead of full spectral. This method employs partial least square (PLS) method for vector to extract the informative components related to predicted property from raw spectra and then corrects the informative components based on spectral transfer such as canonical correlation analysis (CCA), direct standardization (DS) and partial least square for matrix (PLS2). The performance of this algorithm was tested by three batches of spectra:corn dataset, tri-component solvent dataset and dataset of dimethyl fumarate in milk. The results showed that the performance of correcting informative components can decrease errors significantly in contrast with those of correcting full spectra.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络