节点文献

化学数据挖掘新算法和定量构性关系基础研究

New Methodology in Chemical Data Mining and Foundational Research on QSPR

【作者】 杜一平

【导师】 梁逸曾;

【作者基本信息】 湖南大学 , 分析化学, 2002, 博士

【摘要】 化学数据挖掘正逐渐引起化学家们的关注。为了有效地挖掘色谱保留指数数据中有关不同化合物保留行为的差异,收集了近50 000条保留指数数据建立了保留指数数据库。同时讨论了建立及使用数据库所遇到的关于数据的查错和纠错、保留指数的温度校正和实验误差估计等问题。本文利用投影寻踪方法对拓扑指数-保留指数关系研究所涉及的数据进行数据挖掘,构建了一个投影寻踪算法。通过对烷烃、烯烃和环烷烃的投影寻踪,发现不同结构的化合物彼此可以按照分子中碳原子数目、分支数目、双键数目、双键位置、共轭与否、环数目及环上分支等分为不同的类别。利用这些已发现的分类信息,对不同类别的化合物建立不同的拓扑指数-保留指数和拓扑指数-沸点关系模型。对于烷烃化合物所建模型的标准误差已接近或达到了实验误差水平,并且有较高的预测能力。另外,当用一种同系物系列中的化合物构建投影方向时,能得到一个针对同系物的分类,并由此提出了类距离变量,用类距离变量可以建立非常优良的构性关系模型。利用拓扑指数间的正交化方法,并考虑性能,提出了拓扑指数的相似性评价指数和差异性评价指数,用来定量地考察拓扑指数之间的相关性和每一种拓扑指数对回归的贡献。计算结果表明它们可以比较合理地描述变量之间的关系,并且对定量构性关系研究中的变量选择也有指导意义。本文提出了块变量的概念,即几个定义相近的一类结构描述符组合在一起形成为一个块变量。通过对一组拓扑指数进行分块、正交化和用典型相关分析方法将正交化的块变量降维到一维等变换,得到一组保持着原变量绝大部分信息的新变量,变量数目大大降低。结果发现此方法很大程度上提高了构性关系模型的拟合和预测能力。复杂样品的色谱分析往往是一个部分组分已知,部分组分未知的灰色分析体系。本文提出了计算灰色分析体系死时间和正构烷烃保留时间的模型和算法,并利用文献上保存的大量保留指数数据对未知组分进行定性。通过对两个石油产品色谱分析例子的应用,发现该算法计算的死时间与实验结果非常接近,而且计算的正构烷烃保留时间和未知组分保留指数也与实验测定结果十分吻合。

【Abstract】 Work in this paper focuses on the data mining from chromatographic retention index data. A retention index database that contains about 50 000 records of retention index is firstly established. Projection pursuit technique is then utilized to do data mining upon the data in order to find out some valuable information about the relationship between the retention indices and structural descriptors. A novel algorithm for projection pursuit is developed in this work. Samples of alkane, alkene and cycloalkane are investigated. Some interesting classifications based on special chemical structures, such as different numbers of carbon atoms in molecules, different numbers of branches, double bonds numbers, position of double bonds, conjugated double bonds or nonconjugated double bonds and numbers of rings etc., have been revealed for these carbonhydrogen compounds with the help of the new algorithm. Different models between topological indices and retention indices are established for different classes of samples obtained from the results of projection. The regression is then significantly improved. This fact shows that there are really several linear models even for alkanes. Furthermore, an interesting projection result is obtained by projection pursuit when compounds in a homologous series are used to calculate the projection direction. This kind of classification shows that all homologous series are seperated each other and have regular distance between each other. Based on this information a new variable called class distance variable is proposed to describe the difference between the classes of homologs. With the help of this variable, a much better model is obtained. Its estimation errors and prediction errors are all very small closing to the measurement error level.Two indices called similarity evaluation index and difference evaluation index are proposed in this work. They can be used to investigate the correlation between topological indices (TIs) quantitatively and also to estimate TIs’ contribution to the regession model in QSPR. The application of these two indices on a data set including alkanes and alkenes shows that they can describe relationship between TIs withreasonable results, and they have potential useness in variable selection. Block descriptor that contains a series of individual TIs with similar defmations is proposed in this work. Followed by combining some individual topological indices into a few blocks, a set of new one-dimesional variables is obtained with the help of canonical correlation analysis without losing major information. With the help of the new variables, models including few variables are established to describe retention indices of alkanes and show improved performance with high correlation coefficient and small residuals.For the chromatographic analysis of complex multicomponent samples in analytical chemistry, some grey analytical systems are often encounted, in which some components are ascertained and others are unknowns. The model and algorithm of calculating dead time and retention times of n-alkanes in a grey analytical system are developed. By using the calculated dead time and retention times of n-alkanes, retention indices of unknown components can be calculated easily. Results obtained by this method for two samples of petroleum products show that the calculated results of dead time, retention times of n-alkanes and retention indices of unknown components are satisfactory with small errors, comparing with the experimental values.

  • 【网络出版投稿人】 湖南大学
  • 【网络出版年期】2003年 02期
节点文献中: