节点文献

支持向量机(SVM)和径向基神经网络(RBFNN)方法在化学、环境化学和药物化学中的应用研究

Application of Support Vector Machines (SVM) and Radial Basis Function Neural Networks (RBFNN) in Chemistry, Environmental Chemistry and Medicinal Chemistry

【作者】 栾锋

【导师】 刘满仓; 张海霞;

【作者基本信息】 兰州大学 , 分析化学, 2006, 博士

【摘要】 定量结构-性质/活性相关(QSPR/QSAR)研究是计算化学和化学信息学研究中的重要研究热点之一。它主要应用各种统计学方法和理论计算方法研究有机化合物的结构与其各种物理化学性质以及生物活性之间的定量关系。QSPR/QSAR的研究对象包括化合物的各种物理化学性质、生物活性、毒性、药物的各种代谢动力学参数等等,研究领域涉及化学、化工、环境化学、药物化学等诸多学科。 建立准确的定量数学模型一直是QSPR/QSAR研究的追求目标之一,而建模方法又是决定模型好坏的一个关键因素,因此新方法的发展一直是QSPR/QSAR研究中的一个重要任务。本论文在研究小组过去10余年来对神经网络方法(ANN),包括BP网络和RBFNN网络的研究基础上,将支持向量机(SVM)方法应用到化学、环境化学和药物化学等领域中,进行了1100多种化学物质的性质、环境毒物的毒性和药物有关的性质的预测,建立了准确的定量结构性质/活性关系模型。 论文第一章简述了定量结构性质/活性关系(QSPR/QSAR)的基本原理,研究过程以及研究现状,其中在研究过程中着重介绍了建模方法。在指出当前神经网络建模方法不足的基础上,详细介绍了一种新的机器学习算法—支持向量机方法,并概括和展望了其在QSPR/QSAR中的应用。 在第二章中,我们将SVM和RBFNN方法应用到化学领域中,主要包括以下几个方面的研究工作: (1)应用多元线性回归(MLR)和SVM方法建立了预测364个有机化合物的范德华常数的QSPR模型。MLR不仅用来建立线性回归模型,同时也作为选择SVM输入描述符的方法。SVM模型的训练集、交互检验集、测试集和整个数据集的均方误差(Mean Square Error,MSE)分别为:常数a:5.96,8.00,6.67和6.65;常数b:9.56×10-5,3.18×10-4,4.22×10-4和2.33×10-4。 (2)应用启发式(HM)和SVM方法分别建立了149个易挥发有机化合物的气相色谱保留时间和5个分子描述符之间的线性和非线性QSRR模型。非线性的SVM模型的结果优于线性HM模型的结果,对于测试集均方误差MSE分别为1.094和1.644。而且预测值与实验值是非常一致的。 (3)用HM和RBFNN方法建立了预测63个有机小分子化合物在低密度聚乙烯上的渗透系数的定量模型。它建立的模型与以往的模型相比,有同样的可靠性。这

【Abstract】 Quantitative structure-property/activity relationships (QSPR/QSAR) studies are important research topics in computational chemistry and chemoinformatics. They have been widely used for the prediction of various physicochemical properties and biological activities of organic compounds by using different statistical methods and various kinds of molecular descriptors.To build a rapid, simple and valid model is one of the important topics of QSPR/QSAR study. Since modeling method is one of the major factors, it is necessary to search for novel type of learning machine. On the basis of the research on artifical neural networks (ANN) by our group in recent 10 years, support vector machine (SVM) was introduced to chemistry, environmental chemistry and medicine chemistry and predicted the important properties of organic compounds, environmental pollutants and drugs in this dissertation. We showed the capability of Radial Basis Function Neural Networks (RBFNN) and SVM in QSPR and QSAR analysis and their potential utilities to solve problems in biology, chemistry and environment science through several applications in classification and correlation analysisA brief description of the QSPR/QSAR principle, research process and status was given in Chapter 1, and among them we gave an emphasis on the methods of model building. In this section, we also indicated the shortcoming of the present modeling method such as ANN and then introduced the new machine learning method—the support vector machine in detail. At last we gave a review and prospect of the application of SVM in QSPR/QSAR field.In Chapter 2, we applied SVM and RBFNN in chemistry. A brief description was given as follows:(1) Multiple linear regression (MLR) and SVM was used to develop QSPR models to predict the van der Waals’ constants of a diverse set of 364 compounds.MLR was utilized to not only select the molecular descriptors but also construct the linear model. The SVM models gave Mean Square Error (MSE) of 5.96 for the training set, 8.00 for the validation set, 6.67 for the test set and overall data sets are 6.65 to constant a. To constant b the value were 9.56x 10"5 for training set, 3.18 x 10 "4 for validation set, 4.22 x 10 ~4 for test set and 2.33 x 10 ~4 for the whole set.(2) The Heuristic Method (HM) and SVM was used to develop the linear and nonlinear QSRR models between the retention time (RT) and five molecular descriptors of 149 volatile organic compounds (VOCs). The mean squared eixors (MSE) in RT predictions for the test data set given by HM and SVM were 1.644 and 1.094, respectively, which showed the performance of SVM model was better than of the HM model. The prediction results are in agreement with the experimental values very well.(3) QSPR study was performed by HM and RBFNN to study the permeability coefficients of 63 various compounds through low-density polyethylene at 21.1 °C. Comparison of the models obtained by us and by others, it can be seen that their performance was comparative. It implied that this approach was suitable and alternative one in the field of polymer science.In Chapter 3, SVM and RBFNN were applied to environmental chemistry.(1) SVM, as a novel type of learning machine, was used to develop a classification model of carcinogenic property of 148 N-Nitroso compounds (NOCs). 7 descriptors calculated solely from the molecular structures of compounds by forward stepwise linear discriminant analysis (LDA) were used as inputs of the SVM model. The accuracy of training set for SVM was 97.4% and the test set was 86.6%. The total accuracy for SVM was 95.2%, which is higher than that of LDA (89.8%). It can be concluded that the steric and electric factor are likely two major factors in the process of carcinogenicity. And it gave a useful and convenient way for classification of the carcinogenicity of N-Nitroso compounds.(2) QSAR models for 93 polychlorinated dibenzofurans (PCDFs), dibenzodioxins (PCDDs), and biphenyls (PCBs) binding to the aryl hydrocarbon receptor (AhR) have been developed based on HM and SVM. Since various membersof the three classes of compounds have been shown to produce qualitatively similar toxicities, a combination of the different classes for each bioactivity were performed in one QSAR study. A subset of five molecular descriptors selected by HM in CODESSA was used as inputs for SVM. The results obtained by none linear SVM model were compared with those obtained by the linear heuristic method. The prediction result of the SVM model was better than that obtained by HM. The model of SVM led to a correlation coefficient (R) of 0.928 and root-mean-square error (RMS) of 0.324 for the test set and the values for HM model are 0.845 and 0.667 respectively. The work clearly demonstrated that single QSAR equation could be developed for the prediction of binding affinity of PCDFs, PCDDs, and PCBs.(3) Quantitative classification and regression models for prediction of sensory irritants (logRD5o) of 142 volatile organic chemicals (VOCs) have been developed. The best classification results were found using SVM: the accuracy for training, test and overall data set was 96.5%, 85.7% and 94.4%, respectively. The nonlinear regression models were built by RNFNN and SVM, respectively. The root mean squared errors (RMS) in prediction for the training, test and overall data set were 0.4755, 0.6322 and 0.5009 for reactive group;0.2430, 0.4798 and 0.3064 for nonreactive group by RBFNN. The comparative results obtained by SVM were 0.4415, 0.7430 and 0.5140 for reactive group;0.3920,0.4520 and 0.4050 for nonreactive group, respectively. This paper proposed an effective method for poisonous chemicals screening and considering.(4) Rat blood: air partition coefficient (logwood) for 100 volatile organic compounds (VOCs) was predicted by QSPR models. Simple molecular descriptors that calculated from the molecular structures alone were used to represent the characteristics of compounds. HM was used to pre-select the whole descriptor sets and to build the linear model. The model of HM led to a correlation coefficient square (R2) of 0.8832. This QSPR models provided a rapid, simple and valid way to predict the log^biood values of VOCs.In Chapter 4, we introduced SVM and RBFNN to medicine chemistry. Two research works were related:(1) QSPR studuies were developed to predict pKa values of a set of 74 neutral and basic drugs by the linear and nonlinear methods based on the HM andRBFNN, respectively. The linear model obtained had a correlation coefficient (R) of 0.884 with an RMS error of 0.482 for the training set, while R was 0.693 and RMS was 0.987 for the test set. The RMS in prediction for overall data set was 0.619. The RBFNN model gave better results: for the training set R= 0.886, RMS= 0.458 and for the test set R= 0.737, RMS- 0.613. The RMS in prediction for overall data set was 0.493. And the model was useful to predict pKa during the discovery course of new drugs when the experimental data were unknown.(2) QSPR method was performed for the prediction of the standard Gibbs energies (AG9) of the transfer of 54 peptide anions from aqueous solution to nitrobenzene based on HM, RBFNN, and SVM. Comparison the results obtained by the three methods, it could be seen that the results of nonlinear model were better than of linear model. And of the nonlinear model, SVM was better than RBFNN. To SVM model, the RMS errors of the training set, the prediction set and the whole data were 1.604, 2.478 and 1.817, and the correlation coefficient were 0.968, 0.947 and 0.962, respectively.

  • 【网络出版投稿人】 兰州大学
  • 【网络出版年期】2006年 09期
节点文献中: