节点文献

化学信息学新算法及在化学、生物与食品科学中的应用研究

Application of Novel Cheminformatics Algorithms Studies in Chemistry, Biology and Food Science

【作者】 杜红英

【导师】 胡之德;

【作者基本信息】 兰州大学 , 分析化学, 2009, 博士

【摘要】 近年来,随着信息科学、计算机科学与互联网的高速发展,一种新的交叉学科-化学信息学(Chemoinformatics)也迅速成长起来了。化学信息学是一门利用信息学的方法来解决化学的问题,同时得到有关化学本质规律的的学科。化学信息学的研究范围十分广泛,内容丰富,例如化学试验设计与优化、定量校正理论、分析信号处理、化学模式识别、模型与参数估计、人工智能等。化学信息学产生于科学家们对化学知识规律的不断需要的过程中。化合物结构与性质/活性定量关系(quantitative structure-property /activityrelationship,QSPR/QSAR)是化学信息学研究中的一个重要应用分支。该方法是指将化合物的结构参数同其生物活性数据以一定的数学模型相联系起来的定量关系。QSPR/QSAR的研究最初应用于生物领域是为了适应合理设计生物活性分子的需要而发展起来的。由于计算机技术的发展和应用,QSPR/QSAR的研究提高到了一个新的水平,且日趋成熟,其应用范围也迅速扩大,研究涉及生物,化学,药物科学,以及食品科学等诸多学科。人们期望用一个成功的数学模型,能从分子水平上理解其微观结构同其宏观性质/活性之间的关系,根据已有的知识,探求化合物性质/活性与结构的相互作用规律,从而推论呈现化合物某些性质的影响因素,然后为设计,筛选或预测具有人们期望性质的化合物提供信息。化学信息学的发展为化学各分支学科的发展提供了多种解决问题的新思路,新方法。本学位论文主要对化学信息学研究中的一些新算法进行了探讨,并把这些新算法成功应用于QSAR/QSPR研究领域中。该论文共包括五章节内容,每一个章节的具体内容如下所示:第一章:简述了化学信息学的基本概念和研究现状,以及多种化学信息学算法,也详细讲述了化学信息学研究的分支之一——QSAR,包括QSAR演变历史,基本原理以及实现的步骤等等。第二章:主要讨论了Quantitative structure-retention relationship (QSRR)方法在多肽色谱保留行为预测的应用研究。具体内容如下:(1)基于线性和非线性建模方法对反相液相色谱(RPLC)的101种多肽保留时间进行了定量结构保留关系建模研究。最佳多元线性回归(BMLR)方法用来选择与保留行为最为密切的分子描述符,并建立线性模型。另外两种非线性回归方法(径向基函数神经网络(RBFNN)和投影寻踪回归(PPR))用来构建非线性模型。RBFNN和PPR模型的训练集的相关系数(R~2)分别为0.9787和0.9881;均方根误差(RMSE)为0.5666和0.4207。结果表明,RBF神经网络和投影寻踪回归将是蛋白质组研究中一种简单且有效的工具,并有望应用于其他类似的研究领域。(2)新颖的化学信息学方法—局部懒惰回归(LLR)首次应用于预测278个多肽在固定金属亲和色谱(镍柱)的保留行为研究。该工作分别用BMLR,PPR和LLR三种方法建立线性和非线性QSRR模型。最佳的LLR模型的训练集和测试集的R~2分别为0.9446和0.9252。该工作证明新颖机器学习算法LLR是一个非常有前途的研究工具,它可用于色谱保留行为研究领域,为协助设计和分离纯化蛋白质和多肽发挥一定的作用。第三章主要描述了QSAR方法在农业和食品科学领域的应用研究,具体内容如下:(1)三种机器学习方法:遗传算法-多元线性回归(GA-MLR),最小二乘支持向量机(LS- SVM),PPR用于100个稻瘟病抑制剂噻唑啉衍生物的杀菌活性研究。线性模型GA-MLR和非线性模型LS-SVM和PPR都得到了良好的预测结果,但非线性模型提供了更加精确的预测能力。结果表明,非线性LS-SVM和PPR方法可以更加准确地模拟噻唑啉分子结构与杀菌活性之间的关系,能够成为研究稻瘟病抑制剂良好的建模工具。此外,这项研究为稻瘟病抑制剂的设计和开发提供了一种新的,简单而且有效的办法,同时得到的与其密切相关的分子结构信息。(2)运用定量结构保留关系方法对藏红花内43种芳香组分的SPME-GC-MS保留时间进行了预测。应用最佳多元线性回归(BMLR)和投影寻踪回归(PPR)方法分别建立了线性和非线性模型,两种方法均得到了较好的结果:线性模型的训练集和测试集的相关系数(R~2)分别为0.9434和0.8725,非线性模型则给出了较好的预测结果分别为0.9806和0.9456。通过对模型的稳定性和预测能力的比较,可以看出非线性PPR方法可以较好的应用到SPME-GC-MS保留行为研究领域内,同时该工作又可以为其他植物和中草药的分离研究提供一种简便有效的方法。第四章主要讨论了定量构效关系在生命科学和医药研究领域内的应用,主要有以下几部分组成:(1)利用QSRR方法对55种药物在固相人工膜色谱内的保留指数进行了线性和非线性建模研究。在该工作中,线性BMLR方法被用来选取与保留指数最为相关的参数,同时建立线性回归模型;利用选取的描述符,应用PPR和LLR方法来建立更加准确的预测模型。通过模型对比,我们发现LLR作为一种新的建模方法,体现出较完美的预测能力,其训练集和测试集的预测结果为:复相关系数(R~2),0.9540,0.9305;均方根误差(RMSE),0.2418,0.3949。结果显示,新型LLR建模方法在QSRR方法研究中表现出了较好的预测能力,同时该方法定会成功的应用于其它类似的色谱研究领域内。(2)利用线性和非线性建模方法研究了80个N-羟基-a-苯磺酰乙酰胺(N-hydroxy-aphenylsulfonylacetamidederivatives,HPSAs)衍生物对三种类型的基质金属蛋白酶的抑制活性。其中线性BMLR方法用来选取关键的结构参数,同时建立线性模型对所选化合物的抑制活性进行了预测;然后以全局格式搜索PPR方法利用选取的参数建立非线性回归模型。最终,线性和非线性模型均能提供较为满意的预测结果。在该工作中,非线性PPR方法首次与格式搜索(GS)方法相结合并成功应用于对HPSAs的抑制活性的建模研究,得到了令人满意的预测结果。该方法的成功为其他模型参数的优化与选取提供了一种捷径。(3)利用线性回归方法和非线性回归方法-格式搜索支持向量机(GS-SVM)和PPR方法对MT3褪黑激素结合位点的亲和性进行了研究。在该工作中,遗传算法被用来选取与研究对象最为相关的结构参数,并建立线性回归模型对MT3褪黑激素结合位点的亲和性进行预测;利用选取的五个结构变量,采用非线性回归方法GS-SVM和PPR方法建立更加准确的模型。通过模型对比,我们发现非线性PPR方法能够对MT3褪黑激素结合位点的亲和性具有比较准确的预测能力。该方法的建立,为设计和开发新型MT3褪黑激素的新型配体提供了一种新型的研究方法。第五章:QSAR方法在化学感应系统相对灵敏度的预测研究。在本章中,BMLR,SVM和LLR三种方法用来完成64种VOCs的气味检测阀值(ODTs)和鼻腔辛辣味阀值(NPTs)相对敏感性的QSAR建模研究,所得的预测结果和相应的实验数据基本吻合。相比之下,LLR方法能够获得更好的预测能力,因此,它在QSAR研究中是一种有效的机器学习算法。此外,本研究还确定了一些重要的分子结构信息,它们与VOC的相对敏感性密切相关。这些信息可以用来选择或制造一些新型的化学传感器,同时也说明LLR方法是一种很有前途的QSAR建模方法,可用于其他类似的化学传感器建模预测研究。

【Abstract】 In recent years,with the development of information science, computerscience and convenient internet, a new interdisciplinary subject—Chemoinformaticsalso developed rapidly. Chemoinformatics is a knowledge utilizing variousinformatics methods to solve the chemical problems, find the essence of chemicalphenomena, and explain the discipline which was hidden in a large-scale data set。Theresearch area of chemoinformatics is very wide and the investigated contents areabundant, such as chemical experiment design and optimization, analytical signaltreatment, chemical pattern recognition, model and parameter estimate, artificialintelligence, etc. Chemoinformatics produced in the continuous process of thechemical knowledge of the necessary laws satisfying the scientists’ needs.Quantitative structure-property/activity relationship (QSPR/QSAR) study is animportant applied branch of chemoinformatics algorithms. It refers to that there existsa quantitative relationship between the structural parameters of compounds and theirbiological activity。The QSPR/QSAR study was first applied in the biological fieldand developed in response to the rational design of bioactivity molecules. Due to therapid development and the extensive application of computer science, the studies ofQSPR/QSAR come into a new era and they have been widely used in several fieldsincluding biology, medicinal science, chemical and food science, etc. Using differentstatistical methods, we expect to develop a successful theoretical model which can notonly develop a method for the prediction of the property of compounds that have notbeen synthesized but also can identify and describe features of molecules that arerelevant to variations in molecular important properties, gain some insight intostructural factors affecting molecular properties and correspondingly provideinformation for the functional design of the molecules.The development of chemoinformatics provides a novel, practical andconstructive approach for the chemical branches’ progress. In this dissertation, we mainly discussed some novel machine learning algorithms in chemoinformatics, andalso applied these methods to QSAR/QSPR research fields. It consists of five chapters;the detailed description of the chapters was shown in the following:In chapter 1, I described the principle of chemoinformatics and the current researchstatus, at the same time, some novel algorithms were also introduced in this chapter.Furthermore, a brief description of one important applied branch of chemoinformatics-QSAR was provided, including its evolution history, basic theory, and implementsteps.Chapter 2 mainly discusses the application of Quantitative structure-retentionrelationship(QSRR)method in the prediction of chromatography retention behaviorsof peptides. A brief description was given in the following: (1) QSRR modelscorrelating the retention times of peptides in reversed-phase liquid chromatography(RPLC) and their structures were developed based on linear and nonlinear modelingmethods. The Best multi-linear regression (BMLR) method was used to select themost appropriate molecular descriptors and develop a linear QSRR model. Anothertwo nonlinear regression methods (Radial basis function neural networks (RBFNN)and Projection pursuit regression (PPR)) were used in the nonlinear QSRR modelsdevelopment. The coefficients of determination (R2) for the training set of these twomethods (RBFNN and PPR) were 0.9787 and 0.9881; the root mean square of errors(RMSE) of these two methods were 0.5666 and 0.4207, respectively. The proposedmethods RBFNN and PPR will be of importance in the proteomic research, and couldbe expected to apply to other similar research fields. (2) Novel method Local lazyregression (LLR) was first used to predict the retention behaviors of peptides in theNickel column in immobilized metal-affinity chromatography (IMAC). The BMLR,PPR and LLR approaches were used to build linear and non-linear QSRR models. TheR2 of the best model LLR model were 0.9446 and 0.9252 for the training and test sets,respectively. By comparison, it was proved that the novel local learning method LLRwas a very promising tool for QSRR study. It could be applied to other chromatography research fields and that should facilitate the design and purificationof peptides and proteins.Chapter 3 described the application of QSAR method in agriculture and food sciencescopes. A brief description was given as below: (1) Three machine learning methodsGenetic algorithm-Multi-linear regression (GA-MLR), Least-squares support vectormachine (LS-SVM) and PPR were used to investigate the relationship betweenthiazoline derivatives and their fungicidal activities against the rice blast disease. Boththe linear and nonlinear modes gave good prediction results, but the non-linear modelsafforded better prediction ability, which meant the LS-SVM and PPR methods couldsimulate the relationship between the structural descriptors and fungicidal activitiesmore accurately. The results show that the non-linear methods (LS-SVM and PPR)could be used as good modeling tools for the study of rice blast. Moreover, this studyprovides a new and simple but efficient approach, which should facilitate the designand development of new compounds to resist rice blast disease. (2) QSRR studieswere performed for predicting the retention times of 43 constituents of saffron aroma,which were analyzed by solid-phase micro-extraction gas chromatography massspectrometry (SPME-GC-MS). The linear and non-linear QSRR models wereconstructed using BMLR and PPR methods. The predicted results of these twoapproaches were both in agreement with the experimental data. The R2 of the bestmodel (PPR) were 0.9806 (training set) and 0.9456 (test set) respectively. This studyalso affords a simple but efficient approach for studying the retention behaviors ofother similar plants and herbs.Chapter 4 described the application of QSAR method in life science and medicineresearch. It contains the following parts: (1) The relationship between the logarithm ofretention indices (log kIAM) of 55 diverse drugs in immobilized artificial membrane(IAM) chromatography and molecular structural descriptors was established by linearand non-linear modeling methods-PPR and LLR. In this study, the BMLR methodwas used to select the most important molecular descriptors and develop a linearQSRR model. Using the selected descriptors, the other two non-linear regression methods, PPR and LLR were also utilized to build more accurate models. Bycomparing these different methods, the LLR model gave the best predictive resultswith R2 of 0.9540, 0.9305; RMSE of 0.2418, 0.3949; for the training and test sets,respectively. The results were also shown that the LLR method was a promisingmethod for QSRR modeling, and could be used in other similar chromatographyresearch fields. (2) QSAR models of three matrix metalloproteinases (MMP-1,MMP-9, MMP-13) inhibition were developed based on linear and non-linearmodeling approaches by a series of N-hydroxy-a-phenylsulfonylacetamide derivatives(HPSAs). The BMLR method was used to develop the linear QSAR model. Globalgrid search PPR method was firstly used in generating the non-linear QSAR model ofMMP inhibitory phenomena. Both the linear and non-linear models could povidepromising prediction results. Six models were built according to different MMPs anddifferent MMPs inhibitory activities (log (106/IC50)). It was proved that thecombination of PPR and Global Grid Search method was a very useful modelingapproach for the prediction of MMP inhibitory activities, and the global grid searchmethod can also be used in other parameter optimization work. (3) The linearregression and non-linear regression methods-Grid search-support vector machine(GS-SVM) and PPR were used to develop QSAR models for a series of derivatives ofnaphthalene, benzofurane and indole with respect their affinities to MT3/QuinoneReductase 2 (QR2) melatonin binding site. Five molecular descriptors selected bygenetic algorithm (GA) were used as the input variables for the linear regressionmodel and two non-linear regression approaches. By comparing the results of thethree methods, it indicated that the PPR method was the most accurate approach inpredicting the affinities of the MT3/QR2 melatonin binding site. Moreover it shouldfacilitate the design and development of new selective MT3/QR2 ligands.Chapter 5 described the application of QSAR method in chemosensory systemsresearch. In this chapter, QSAR models were successfully developed for predictingthe relative sensitivities-odor detection thresholds (ODTs) and nasal pungencythresholds (NPTs) for the olfaction and nasal trigeminal chemosensory systems of a set of volatile organic compounds (VOCs). The BMLR, SVM and LLR were used tobuild regression models. By comparing the results of these three methods, the LLRmodel gave better results. Furthermore, this investigation also identified someimportant structural information which was strongly correlated the relativesensitivities of these VOCs. Such information can be used to select and manufacturechemical sensors in the future. The LLR method is a promising approach for QSARmodeling, and it also could be used to model the other similar chemical sensors.

  • 【网络出版投稿人】 兰州大学
  • 【网络出版年期】2009年 11期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络