节点文献

QSAR研究中提高模型预测能力的新方法探讨及其在药物化学中的应用

New Strategies to Improve the Predictivity of QSAR Models and Its Application in Medicinal Chemistry

【作者】 李加忠

【导师】 姚小军;

【作者基本信息】 兰州大学 , 分析化学, 2009, 博士

【摘要】 科学的发展是服务于现实生活的。人们经常会以“到底有什么实际用途”、“与现实生活有什么关系”或者“能否解决实际问题”来衡量一个新鲜事物。定量结构活性关系(Quantitative Structure-Activity Relationship, QSAR)研究也不例外,其在解决实际问题中的实用性一直备受关注。QSAR发展至今,应用已经非常广泛,其研究对象包括化合物的生物活性、毒性、药代动力学参数和生物利用度以及分子的各种理化性质和环境行为等,研究领域涉及生物、药学、化学以及环境科学等诸多学科。人们期望通过各种QSAR模型从分子水平上理解化合物的微观结构与其宏观活性之间的关系,为设计、筛选或预测具有人们期望的性质的化合物提供信息。在众多的应用中,利用模型预测未经实验测定甚至未合成的新化合物的相关活性,是QSAR模型最重要的用途之一。但是要用于预测新化合物,QSAR模型必须具有较高且可信的外部预测能力。因此本论文从建立QSAR模型的各个步骤考虑,试图解决目前QSAR研究中某些还有待完善的问题,重点研究了定量构效关系研究中化合物低能构象的选择问题、提出了几种新的建模策略、引入新的建模方法,旨在尽可能地提高QSAR模型的可靠性以及外部预测能力。同时,将具有很好预测能力的定量构效关系模型用于活性化合物的设计和筛选。论文第一章对定量结构活性关系研究进行了概述。从QSAR的发展历史、研究现状到发展趋势,从模型的建立、检验到应用,都进行了详细的阐述,并重点讨论了模型的验证问题。另外为了对QSAR建模方法有清晰的认识,本章从不同角度对各种QSAR方法进行了分类归纳;论文第二章讨论了二维QSAR研究中的一个基本问题——化合物构象对于定量构效关系模型的影响。旨在分析不同能量优化方法所得到的低能构象的差别、对最终QSAR模型的影响有多大。基于什么样的构象对于建立最终模型的至关重要,这也是一个QSAR研究的基础。通过对三组复杂程度不同的化合物进行研究后,得出了以下主要结论:(1)进行分子的三维结构优化时所用的初始构象能够影响模型的最终结果,并且分子结构越复杂影响越大;(2)构象搜索能够给出能量较低的分子状态,它可以协助分子力学或半经验等优化方法很快很容易的找到全局最优的低能构象;(3)如果所建立的QSAR模型用于新化合物的预测,则新化合物最好与训练集数据使用同样的优化方法;第三章介绍本文提出的两种新颖的一致性建模分析方法:WCM和改进的CDFS。一致性建模分析是一种新型的建模方法,但目前用来建立一致性模型的方法都是平均策略(ACM)。实际上不同的子模型包含的信息不同,对于最终活性的贡献也不同。因此本文提出一种更加合理的加权策略(WCM),考虑用多元线性回归的方法给子模型不同的权重,并且提出了Q2引导的子模型选择策略(QGMS)来指导子模型的选择过程。这两种策略用于一系列丙二酰辅酶A脱羧酶抑制剂的定量构效关系研究,WCM模型的表现优于ACM和最佳单个模型,模型的拟合能力和预测能力都有很大提高,且模型更加稳定可靠,可解释性增强。CDFS是另一种一致性建模思路。CDFS方法将数据集进行多次分组分别建模,然后取模型的公共描述符建立最终模型。该方法的缺点是很难保证所得到的若干训练集的代表性。本文提出利用科学的分组方法得到具有代表性的训练数据,基于该数据利用不同的描述符组合进行建模,描述符出现频率越高说明其包含的结构信息越重要,然后取出现频率高的描述符建立最终模型。该方法用于169个噻唑类淋巴细胞特异性激酶抑制剂的定量构效关系研究,最终得到了包含八个公共描述符的模型,得到了很好的结果;第四章指出了局部建模local lazy regression (LLR)方法中一个问题,并且提出了相应的解决办法。在局部建模分析中,如何确定最优的临近点数量(k)对模型的预测是至关重要的,目前使用的方法是利用抽一法交互验证(LOO-CV)的Q2来自动决定。而LOO-CV只是一种内部检验技术,不能说明模型的外部预测能力,因此建立模型进行预测的可靠性值得怀疑。本文提出通过监测局部模型的外部预测能力来提高LLR预测的可靠性和准确性,并用于黑色素浓缩激素受体1拮抗剂的定量构效关系研究,提高了模型的预测能力和预测可信度,得到了很好的结果;第五章应用两种新型的非线性建模方法最小二乘支持向量机(LS-SVMs)和基因表达式编程(GEP)进行建模分析,使模型的拟合能力和预测能力都有一定的提高。本论文中, (1) LS-SVMs方法用于羟吲哚类细胞周期依赖性激酶(CDK)抑制剂的分类,模型分类正确率比线性判别分析(LDA)模型有很大提高;(2) LS-SVMs方法用于44个人类肝脏糖原磷酸化酶(hlGPa)抑制剂,模型的抽一法交互验证表明LS-SVMs模型更加稳定,非线性模型的预测能力比多元线性回归(MLR)模型更强,且在此工作中验证了QSAR研究中进行描述符选择的必要性;(3) LS-SVMs方法用于吡嗪-吡啶类血管内皮生长因子受体2(VEGFR-2)抑制剂的定量构效关系研究,模型的预测能力比线性MLR模型有很大程度的提高;(4)非线性GEP方法用于62个MCHR1拮抗剂的QSAR研究,所得GEP模型的拟和能力尤其是外部预测能力都比线性MLR方法有很大提高,Rext2从线性的0.756提高到0.819;第六章重在讨论模型的应用——数据库挖掘和虚拟筛选。提出了一个新颖的QSAR/docking混合策略对淋巴细胞特异酶Lck抑制剂进行QSAR研究,所建模型用于虚拟筛选化合物数据库,最终筛选出两个磺酰基脲类衍生物,它们与Lck激酶活性位点的结合模式与文献报道的已知抑制剂非常相似,并且具有较高的预测活性。其中关键的磺酰基脲和疏水基团子结构可以作为Lck抑制剂结构优化的先导骨架。本研究所提出的策略可以从多方面考虑训练数据的结构特征,并且可以保证训练集数据的多样性,成功地将基于配体的虚拟筛选(LBVS)和基于受体的虚拟筛选(SBVS)有机地结合到一起进行化合物数据库的筛选。

【Abstract】 The development of science comes from the need of daily life.When people talk about a newthing,they always ask questions like:is it useful or can it really solve some problems?Quantitative structure-activity relationship (QSAR) methodology is also in this situation now.Actually the QSAR definition came up with the development of medicinal chemistry,so from thebeginning it is useful.Up to now,QSAR has been widely used in biology,chemistry,medicinalchemistry and environmental science etc.And the endpoints include bioactivity,toxicology,pharmacokinetics (ADME),molecular properties and some environmental related endpoints etc.Researchers try to understand the relationship between the microcosmic molecular structure andthe macroscopical behaviors,and find out the important structural information related to thecorresponding endpoints to facilitate the design or screening of compounds with desired activities.Mostly QSAR models are used to predict the corresponding endpoints for unmeasured orunseen new compound.But if we want to make prediction for new compounds,the used QSARmodels must be vigorously validated.And the higher and reliable external predictivity is essential.So this dissertation aimed to improve the reliability and predictive ability by considering the eachstep in the whole modeling process and tried to solve some aspects needed to be improved inQSAR methodology.We discussed the influence of molecular conformations optimized bydifferent methods on the quality of QSAR models,several novel modeling strategies and used twonovel nonlinear modeling methods to build QSAR models.Furthermore,we proposed a newhybrid QSAR/docking approach for virtual chemical database screening to screen novel pan-SrcLck inhibitors.In this dissertation,a brief description of the QSAR principle was given in Chapter 1,including the history,principle,and research status of QSAR studies.We discussed the process tobuild a stable,reliable and predictive model,and among them we gave an emphasis on the modelvalidation techniques.Furthermore,to understand different modeling methods clearly,weclassified all the methods from different views.Additionally the trend and several novel ideas inQSAR area were also summarized.In Chapter 2,we discuss a basic problem in QSAR modeling-the lowest-energyconformation used to build model,aiming to analyze the influence of molecular conformations optimized by different methods on the quality of QSAR models.We used three datasets withdifferent structural complexity,SMF,Lckl and NS5BI.Comparing the obtained results,we drewour conclusion as following:(1) The original input conformations are very important in structureoptimization task,which may influence the quality of the QSAR model,especially for moleculeswith much flexibility;(2) Conformation searching aimed to find better original conformation nearto the low-energy conformation maybe play an important role in the optimization process;(3)New samples in the test set should use the same optimization process with the training samples ifwe want to predict the corresponding endpoint accurately.In Chapter 3,we discussed two new consensus modeling strategy proposed by us.Consensusmodeling,which uses several submodels to make prediction for a new compound,is a novelstrategy in QSAR research.In all the published consensus models,the final prediction of a sampleis obtained by a simple average of the results predicted by all the contained submodels (averageconsensus modeling,ACM).However,maybe it is more reasonable to give each submodel adifferent weight (weighted consensus modeling,WCM).So in this work,to give a reasonableweight for every submodel,the results predicted by all the involved submodels serve as variables,and multiple linear regression (MLR) method was used to give them different weights.Furthermore we proposed Q2 guided model selection (QGMS) to guide the sumbodels selection.The obtained results indicated that WCM consensus model based on QGMS submodel set couidgive highest fitting ability and external predictivity.Combined data splitting-feature selection (CDFS) is also a kind of consensus modelingmethod.With CDFS,data splitting is achieved many times and in each case feature selection isperformed.Then the resulted models are compared and the final model is the one whosedescriptors are the common variables among all of the resulted models.The shortcoming of CDFSis that it is very hard to say that each training set could span the whole descriptor space so as torepresent the studied data set.We proposed a new strategy to build this kind of final model in adifferent way.At first,we got a training set using rational data splitting method.Then a modelpopulation was established by GA-MLR using training set data only.Descriptors with higherfrequency were considered as key structure features related to the inhibition activities.So theywere extracted to build the final QSAR model.This strategy was used to analyze 169aminothiazole based Lck inhibitors,and the obtained results were satisfactory. In Chapter 4,we pointed out a self-contradictory problem in local QSAR prediction,andproposed a solution to this problem.The commonly used local method is local lazy regression(LLR).It has been proved that any improvement in prediction from LLR is dependent on thenature of the neighborhood obtained for a given query point.In LLR,the leave-one-out crossvalidation (LOO-CV) procedure is usually used to optimize the number of neighbors (k),and themodel giving the lowest LOO-CV error or highest LOO-CV correlation coefficient is chosen asthe best model to make prediction.However,LOO-CV is just an internal validation technique,andthe good statistical value from LOO-CV appears to be the necessary but not the sufficientcondition for the model to have a high predictive power.So we proposed a new strategy toimprove the predictive ability of LLR models and to access the accuracy of a query prediction.The bandwidth of k neighbor value for LLR is optimized by considering the predictive ability oflocal models using an external validation set.This approach was applied to the QSAR study of aseries of melanin-concentrating hormone receptor 1 (MCHR1) antagonists.The obtained resultsfrom the new strategy shows evident improvement compared with the commonly used LOO-CVlocal lazy regression methods and the traditional global linear model.In Chapter 5,we used two novel nonlinear methods to build QSAR models:least squaresupport vector machines (LS-SVMs) and gene expressing programming (GEP).The LS-SVMsmethod was used to analyze the structure-activity relationship (SAR) of a series of oxindole basedcycle-dependent kinase (CDK) inhibitors,and the LS-SVMs classifier predicted the test setsamples into the right class more accurately than linear discriminate analysis (LDA) classifier.Then LS-SVMs method was used to build QSAR models for 44 human liver glycogenphosphorylase a (hlGPa) inhibitors and 32 pyrazine-pyridine based vascular endothelial growthfactor receptor 2 (VEGFR2) inhibitors.The obtained nonlinear models perform much better thanthe linear MLR models.At the end,nonlinear GEP method was used to analyze the quantitativestructure-activity relationship of 62 melanin-concentrating hormone receptor 1 antagonists.Thefitting ability and external predictivity of GEP model were both better than MLR model.Especially the Rext2 of 0.819 for the GEP model was much higher than linear model.In Chapter 6,we proposed a new hybrid QSAR/docking approach for virtual chemical databasescreening and further used to mine a drug database to screen novel pan-Src Lck inhibitors.As aresult,two sulfonylurea derivatives were predicted to be the potential Lck inhibitors in silico, which could bind to the target protein active site in a very similar mode to other reportedinhibitors.And the key sulfonylurea and hydrophobic substructures can be used as a lead skeletonto further Lek inhibitor design.The proposed strategy is a successful combined application ofLBVS and SBVS,which can take into account all important aspects of the structure features forthe training samples while guaranteeing the diversity of training set.The obtained results indicatethat the proposed approach for chemical screening is of practical utility and can be used as ageneral tool to screen chemical database and discover lead compounds.

  • 【网络出版投稿人】 兰州大学
  • 【网络出版年期】2009年 11期
节点文献中: