节点文献

多元校正新算法研究和二维数据分析方法在色谱分离评价中的应用

New Algorithms for Multivariate Calibration and Applications of Two-way Data Analysis to Chromatographic Separation Evaluation

【作者】 徐路

【导师】 俞汝勤;

【作者基本信息】 湖南大学 , 分析化学, 2008, 博士

【摘要】 本文作者对多元校正中的一些难点问题进行了深入的研究,提出了多种新型化学计量学算法,并将其应用于标准校正数据集的研究,另外也对化学计量学二维数据分析方法在色谱分离质量评价中的应用进行了一些研究。本论文主要包括以下几个方面的工作:1.探讨了多元校正建模中的训练集样品的代表性和最优化样品加权问题。由于多元校正的样品光谱空间的多维性和复杂性以及样品选取过程中的不确定性,准确估计训练集样品在整个样品空间的代表性尚存在一定困难。传统的多元校正模型大多根据经验方法选择代表性样品,在某些不利的情况下可能会影响校正模型对新样品的预测性能。为解决以上问题,同时考虑到样品的代表性很难通过考察单个样品进行估计,我们把全局优化样品加权的思想和偏最小二乘相结合,提出了最优化样品加权偏最小二乘这一新算法。该算法通过对原来的训练集样品进行非负加权,在校正建模过程中同时考虑了模型的复杂性和预测能力,最优样品权重通过粒子群优化算法搜索获得。另外,为了使样品加权偏最小二乘的建模和优化更加易于计算,我们进一步证明了样品加权校正模型可通过对每个样品的光谱数据和组分浓度值乘以一个相同的非负常数实现。将该算法应用于真实的标准数据集的结果表明,在原始校正样品的代表性较差时,最优化样品加权偏最小二乘算法确实能够改善模型的预测性能。2.基于粒子群优化算法,我们提出了一种较传统的变量选择方法更为灵活的变量加权方法。通过对传统的基于变量选择的校正模型的考察可以发现,进入校正模型的变量实际上被赋予权重1,而被模型舍弃的变量的权重实则为0。如果把权重的概念引入变量选择,允许变量的权重取非负的连续值,则传统的变量选择只是变量加权的一种特殊情况。另外,由于变量加权的目标是同时优化校正集的训练和验证集的预测,连续非负的变量加权实际上可视为对光谱变量的某种最优化重新刻度,因此比传统的变量选择有更多的灵活性。对真实校正数据集的研究表明,变量加权偏最小二乘方法不仅能起到变量选择的作用,还能够在校正模型中保留较多的变量,保持了多元校正的多通道优势。3.我们改进了一种新的机器学习算法—叠加回归,并将其应用于多元校正,同时实现了波长区间的快速自动优化选择和校正模型组合。我们用蒙特卡罗交互验证代替了叠加回归中的传统的交互验证,再用改进了的叠加回归算法组合建立在单个波长子区间上的偏最小二乘模型,所得模型在组合系数非负的约束下具有最小的蒙特卡罗交互验证均方根误差,所以可以期望组合模型具有较好的泛化性能和防止过拟合的能力。叠加回归能够通过非负最小二乘法确定模型组合系数,把某些光谱子区间模型对应的组合系数置为0,从而实现波长子区间的自动选择。另外,由于线性组合模型的蒙特卡罗交互验证可通过组合一系列子模型的蒙特卡罗交互验证来实现,而单个的光谱子区间模型的交互验证计算量很小,所以该方法与同类区间选择方法相比,计算量要小得多。对标准校正数据集的研究进一步证实了该方法的实用性。4.我们提出了一种多元校正中近红外光谱数据预处理的新概念—群预处理方法。由于近红外光谱数据经常受到背景、基线漂移和噪声等不利因素的影响,对原始光谱测量数据进行适当的预处理在很多情况下已经成为多元校正的必要步骤。但是,由于光谱的复杂性和先验信息的缺乏,确定最好的预处理方法常常需要多次尝试,并且要求操作者有一定的数据处理经验;另外,单一的预处理方法在改善数据的某些方面的同时,也可能带来某些方面的负面影响和面临信息丢失的风险,并且基于单一预处理方法的校正模型对新样品的预测可能缺乏稳定性。为解决以上问题,我们提出了近红外光谱的群预处理方法,该方法用蒙特卡罗交互验证叠加回归算法组合一系列基于不同预处理方法的校正模型,可以实现预处理方法的自动选择和优化加权。对真实校正数据集的研究结果表明,基于群预处理方法的校正模型与基于单一预处理方法的校正模型相比,不仅保持或改善了原有模型的准确性,而且模型的稳定性有所提高。5.我们把移动窗口偏最小二乘算法应用于多元校正的模型转移,建立了高稳定性和低复杂度的全局校正模型。当把已有的校正模型应用于新样品的光谱校正时,如果新样品的光谱含有与模型的训练样品不相同的光谱贡献时,为防止出现偏差和严重的误差,就需要对原有的校正模型进行校正转移。我们把一种新的波长区间选择方法—移动窗口偏最小二乘法引入到全局校正模型中。移动窗口偏最小二乘法能够选择与化学组分相关的光谱子区间,并且能够降低全局模型的复杂度。通过对标准的校正数据集的研究,基于移动窗口偏最小二乘的全局模型确实体现了上述优点,较好地实现了校正模型的转移。6.我们讨论了基于单通道检测器的色谱图的传统的色谱分离标准在估计色谱分离质量时可能遇到的问题,并且指出,很多问题都是由于一维色谱图在严重峰重叠的情况下缺少诸如组分数、重叠度和峰纯度等信息造成的。然后,我们综述了化学计量学二维数据分析方法在色谱分离效率评价中的应用,并且依据文献和我们的研究经验,对某些重要问题进行了讨论。7.我们提出了一种新的基于秩图的色谱分离评价指标—峰纯度加权分辨率。与传统的基于单通道信号检测器的色谱分离标准相比,峰纯度加权分辨率的优势在于它同时利用了化学组分数、重叠程度、流出时间和峰纯度等关键色谱信息,而这些信息在色谱峰严重重叠时是很难从一维色谱信号中获得的。对模拟色谱体系和一个真实色谱体系的研究表明,峰纯度加权分辨率的值能合理地反映色谱重叠程度的大小,该标准确实可用于严重重叠的色谱图的分离估计。最后,我们还讨论了使用峰纯度加权分辨率时应当注意的问题。

【Abstract】 The research work in this thesis focuses on new chemometric algoritms for multivariate calibration and the applications of two-way data analysis methods to chromatographic separation evaluation.The representiveness of training samples for multivariate calibration has been discussed and the concept of weighted sampling has been introduced to multivariate calibration. Due to the high-dimensionality and complexity of spectral data space and the uncertainty involved in sampling process, the representiveness of training samples in the whole smple space is difficult to evaluate and selection of representative training samples for multivariate calibration depends largely on experiential methods. If the training samples fail to represent the sample space, sometimes the predictions of new samples can be degraded. In order to solve this problem, a new algorithm for multivariate calibration is developed by combining optimized sampling and partial least squares (PLS), where the original training samples are non-negatively weighted and the complexity and predictivity of the model are considered simutaneously. Moreover, it has been proved that weighted sampling can be achieved by multiplying both the spectrum and concentration value of a sample by the same non-negative constant, which has made the computation of sample-weighted models much easier. Two real data sets are investigated and the results demonstrate that sample-weighted PLS models can improve the predictivity of a model when the representiveness of original calibration sample is poor.Based on particle swarm optimization (PSO) algorithm, a more flexible method for variable selection, variable weighting is proposed. We have revisited traditional variable selection methods and found that in such methods the variables included in the model are essentially weighted with ones and those excluded from the model are weighted with zeros. If continuous non-negative weights are allowed, the traditional variable selection is just a special case of variable weighting. Since the variable weights are determined to simultaneously optimize the training of calibration set and the prediction of validation set, variable weighting can be seen as an optimized rescaling of the variables in certain sense and therefore is more flexible than traditional variable selection methods. Results obtained from real data sets indicate that variable-weighted PLS (VW-PLS) can not only play the same role as variable selection but can also maintain the multi-channel advantage by including more variables in the model.A new machine learning method, stacked regression is improved and then introduced to multivariate calibration to achieve automatic and fast sepectral interval selection. Instead of traditional cross validation (CV), Monte Carlo cross validation (MCCV) is adopted in the improved stacked regression, which is then used to combine the regression models built on different spectral intervals. With the non-negative constraints of the cobination coefficients, the resulted combined model has the minimum root mean squared error of MCCV (RMSEMCCV), so the model is expected to have good generalizing ability and less risk of overfitting. Stacked regression can obtain the combination coefficients by non-negative least squares (NNLS) and spectral interval selection is achieved by setting some coefficients to be zeros. Moreover, because MCCV of a linearly combined model can be achieved by linearly combining the MCCV of the separate interval models, which is much simpler to compute, the computation of MCCV stacked regression is economical. The practicability of the proposed method is demonstrated by its applications to two real data sets.A new concept of data preprocessing for multivariate calibration, ensemble preprocessing is proposed. Because the raw near infrared (NIR) spectra are often influenced by factors such as backgrounds, baseline shifts and noise, it is necessary to preprocess the raw data properly in multivariate calibration. However, due to the complexity of NIR data and lack of prior information, to achieve the optimal data preprocessing is still trial and error and requires the experience of practitoners. Another disadvantage of traditional preprocessing methods is that any preprocessing method has the risk of information loss and might degrade the data in some aspects while improving the data in certain aspects. Moreover, models based on a single preprocessing method are sometimes instable for predicting new samples. To solve the above problems and achieve the automatic selection and optimization of preprocessing methods, an ensemble preprocessing method is developed by combining calibration models based on different preprocessing methods through MCCV stacked regression. Results obtained from real data sets demonstrate that compared with traditional preprocessing using a single method, ensemple preprocessing can lead to a more stable calibration model while maintaining or improving the precision of the model.Moving window partial least squares regression (MWPLSR) is introduced to calibration transfer to develop a stable and low-complexity global calibration model. When applied to new samples containing spectral variations not calibrated, the existing calibration model should be adjusted to avoid bias and serious error. MWPLSR can select concentration-correlated spectral intervals and reduce the complexity of the global calibration model. Investigation of two benchmark data sets has confirmed that global calibration model based on MWPLSR has the above advandages as expected and can achieve stable and reliable calibration transfer.The disadvantages of traditional chromatographic separation criteria based on chromatograms recorded by single-channel detectors are discussed. It is further pointed out that many of these problems are caused by lack of information concerning number of components, peak purity and overlap degree in the presence of seriously overlapped peaks. Then the applications of two-way chemometric methods to assessing chromatographic separation quality are reviewed and some important problems involved are discussed according to literatures and our research experience.A new chromatographic separation criterion, peak-purity weighted resolution (PPWR) based on rank graph is proposed. Compared with traditional separation criteria based on one-way chromatograms, the advantages of PPWR lie in the fact that it gracefully considers the information concerning number of components, peak purity and overlap degree, which is difficult to obtain from one-way chromatograms with serious overlaps. PPWR is applied to a simulated data set and a real chromatograhic system, indicating PPWR is indeed a reasonable separation criterion for seriously overlapped peaks and can reflect the overlap degree. Finally some important problems that might be encounted when using PPWR are discussed.

  • 【网络出版投稿人】 湖南大学
  • 【网络出版年期】2009年 08期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络