节点文献

分析化学多维数据解析的化学计量学新算法

New Chemometric Algorithms for the Analysis of Multi-way Data Arrays in Analytical Chemistry

【作者】 王志国

【导师】 俞汝勤;

【作者基本信息】 湖南大学 , 分析化学, 2005, 博士

【摘要】 随着联用分析仪器的大量涌现,分析化学工作者常可轻而易举地获得大量的分析数据,这些数据包括了有关化学体系的丰富信息,如体系内组份数、各组份的含量以及特征光谱等。从这些大量的数据中提取有用的化学信息是传统的数据处理手段所难以胜任的,借助化学计量学算法可能解决许多问题。化学计量学是上个世纪七十年代发展起来的一门新兴的化学分支学科,它运用数学、统计学与计算机科学的方法来设计或选择最优化的量测方案,并通过解析化学量测数据最大限度地获取化学及相关信息。化学计量学理论与方法的发展,丰富了现代分析化学的基础理论。化学计量学理论与方法中,分析化学中的多维数据的解析方法是研究最为活跃的领域之一。目前,多组份体系分辨与校正领域的研究主要集中在二维数据和三维数据的解析。多维数据解析理论和方法的发展使人们对复杂化学体系的分析能力有了质提高; 使传统分析化学难以处理的“灰色”与“黑色”体系的直接分析成为可能。由于化学数据具有其内在特点,利用这些特点发展能够适用复杂体系的化学计量学算法业已成为多维数据解析研究领域的一个重要趋势。本文作者通过仔细分析当前化学计量学发展的方向,及其研究的热点,选取了二维化学数据分析及三维化学数据分析中的几个较为重要的问题进行了研究。主要涉及以下几个方面: 一、二维化学数据解析的基本理论和算法(第一章一第三章):当一个色谱峰完全位于另一个色谱峰的流出区间内的时候,我们把里面的色谱峰称为内层色谱,相应的组份称为内层组份。相对地,外面的色谱峰称为外层色谱,其对应的组份称为外层组份。相对于紫外等宽带吸收谱,质谱具有离散性的特点。对于色谱包含峰体系,若其内层组份与外层组份在质谱检测通道上给出不同的信号,就意味着质谱方向上可能存在选择性的检测通道给出纯色谱。利用这一化学先验知识,我们设计提出了解析色谱包含峰的内层色谱投影法。该方法对色谱峰形状及内层色谱与外层色谱的相对位置没有特殊要求,适用于质谱检测的各种色谱包含峰的解析。在二维数据中,纯光谱又被称为纯变量。经过归一化,二维数据点在空间中都位于以纯变量为顶点的超球体内,这是二维数据的几何模型。我们充分利用了这一模型,提出了二维色谱数据分辨的顶点矢量顺序投影法。由于色谱方向上普遍存在选择性区域,即在几何空间中存在位于顶点的纯变量,顶点矢量顺序投影法可以逐个确定存在的纯变量,并通过迭代算法优化所获得的纯变量。

【Abstract】 With the emergence of many hyphenated instruments, analysts can easily obtain very large volume of analytical data matrices, which consist of hundreds and even thousands data points. These data matrices or data arrays contain abundant chemical information including the number of chemical components, the pure spectra, chromatograms and contents of these components. However, it is a hard task to extract the above information from the data matrices composed of vast data points just by conventional data processing techniques. Analysts have to resort to chemometrics, which is a new sub-branch of chemistry and came forth 70’s last century. As an interface of chemistry with mathematics, statistics and computer science, chemometrics designs and selects optimal schemes for chemical measurements and maximally extracts chemical information from the data. With the evolution of chemometrics, its methodologies enrich comprehensively the fundamental theory of modern analytical chemistry. Among the bulk of chemometric methodologies, multi-way data analysis in analytical chemistry is one of the most active areas with practical significance. Two-way and three-way data analysis has gained a wide interest in the resolution and calibration of multi-component systems. These multi-way data analysis approaches provide a promising tool for the direct analysis of the so-called “grey” and “black” analytical systems. Since the chemical data characterize chemical systems, incorporating priori chemical information into the chemometric algorithm has become an important trend in multi-way data analysis. The present thesis primarily involves the following aspects of multi-way data analysis in analytical chemistry: 1. Two-way data analysis (Chapter 1 to Chapter 3): A chromatographic peak located inside another peak in time direction is called an embedded or inner peak in distinction with the embedding peak, which is called an outer one. The chemical components corresponding to inner and outer peaks are called inner and outer components respectively. The ultraviolet-visible and near infrared spectra of chemical compounds are band spectra, while the mass spectra possess the feature of discreteness. If the inner and outer components give different signals on different measuring channels, it is possible that there exist selective channels that represent pure chromatograms. Based on this priori chemical information, the inner chromatogram projection (ICP) method is proposed for resolution of GC-MS data with embedded chromatographic peaks. ICP is capable of achieving satisfactory performance not affected by the shapes of chromatograms and the relative position of two components. It could be utilized to resolve any pattern of embedded chromatograms with mass spectroscope as a detector. In two-way data analysis, pure spectra are also referred to as pure variables. Subjected to any form of normalization, the two-way data points are located on a certain hyper-“spherical” surface with the vertices constituted by the pure variables. A rational resolution procedure, named vertex vector sequential projection (VVSP), for determining pure variables in two-way data is developed by making full use of the above geometry of two-way data. Since there commonly exist selective regions in the time direction, VVSP would definitely ascertain the pure variables one by one, and then refine them through an iterative optimization procedure. The proposed method is approved to be a competent tool for the resolution of two-way data. Additionally, VVSP does not require the ascertainment of feature regions and its principle and implementation are straightforward. For the determination of elution windows and patterns, the pure spectrum evolving projection (PSEP) method is proposed. PESP tries to find pure spectra or pure projected spectra and utilizes the evolving projection method to find the elution windows of the overlapping chromatograms component by component. PESP could locate the starting and ceasing elution points of all components; more importantly, it gives a direct indication of elution patterns. PSEP has been approved to be a useful tool in discovering the elution windows in two-way data. 2. Three-way data analysis (Chapter 4 to Chapter 8): The most important prerequisite for the three-way data analysis is that the data arrays should strictly follow the trilinear model. In order to improve the accuracy and reliability of the decomposition of three-way data contaminated by nonlinear data or outliers, the iterative reweighted parallel factor analysis (IRPARAFAC) is proposed. The basic assumption of the proposed method is that the residues corresponding to the data entities contaminated with large deviation are larger than those of others. Cosine function is used to decide the weight for each entity. The IRPARAFAC algorithm iteratively updates the weights with the improvement of the unmodeled residues. During the iterative procedure, the data entities with large deviations will be discovered gradually and assigned with small or even zero weights. Hence their influence on the chemical loading parameters can be gradually mitigated. TheIRPARAFAC algorithm provides a promising tool to qualitative and quantitative analysis of trilinear data array containing nonlinear data or outliers. The chromatographic shifting could hardly be avoided because the stability of both operator and the state of the instrument could not always be guaranteed from run to run. If the shifting is severe, the trilinearity is no longer satisfied. Aiming at solving the problem, the VVSP method is utilized to the analysis of three-way chromatographic data. The three-way data array is unfolded along a certain direction into one matrix and a multi-bilinear model is obtained. Then the VVSP method is utilized to select the pure variables and iteratively improve the fit of the data to the multi-bilinear model. The multi-bilinear model guarantees that the chromatograms in each sample could be resolved separately, which circumvents the difficulty of model deficiency caused by retention time shifts. The results of both simulated and real chemical data sets have demonstrated that the proposed method is more efficient than PARAFAC when the chromatographic shifts are very severe. If the chromatographic shifts are slight or subjected to adjustment, the three-way data could be regarded to be decomposable in trilinear domain. Thus the trilinear evolving factor analysis (TEFA) is proposed by making use of the trilinearity of three-way data and the evolving nature of chromatography. Comparing with the two-way matrix, the three-way data arrays provide a matrix on each point along the time direction. The superiority of higher dimension of three-way data arrays supplies one with the possibility of conducting the singular value decomposition (SVD) on each elution detection point, while a number of neighboring profiles are needed to perform SVD in two-way resolution. So the rank map of three-way data could be obtained by direct rank analysis of matrices on each time point. Provided the trilinearity is guaranteed, accurate eluting information could be obtained from the rank map. Additionally this method need not consider the selection of window size, which affects the selectivity and sensitivity in two-way case. From rank maps, selective regions could be determined as well as spectral and concentration profiles, then the coupled vector resolution (COVER) method can be utilized to resolve the chromatograms. The COVER method needs pure variables of two dimensions or at least one calibration sample to achieve the resolution. This requisite could be met by the information acquired from the trilinear rank map. TEFA-COVER realized the idea of resolving profiles component by component through the deduction of resolved components. As a result, it achieves the direct resolution of a “black” system.The number of components is also called the chemical rank of the three-way data. The determination of chemical rank is crucial to the decomposition of three-way data arrays. Thus the chemical subspace projection (CSP) method is proposed for the determination of chemical rank in three-way data arrays. The proposed method projects the unfolded three-way data to the chemically meaningful subspaces and determines the chemical rank by checking the length of the projected vectors. The proposed method is simple to use and can give accurate estimate of the component number in an ordinary three-way data array.

  • 【网络出版投稿人】 湖南大学
  • 【网络出版年期】2006年 06期
节点文献中: