节点文献
基于OECD准则对QSAR/QSPR模型几个重要问题的研究
Studies on A Few Key Problems of QSAR/QSPR Modeling Based on the OECD Principles
【作者】 陈宪;
【导师】 梁逸曾;
【作者基本信息】 中南大学 , 应用化学, 2013, 博士
【摘要】 摘要:本论文研究主要根据国际经济合作与发展组织(OECD)准则的要求,对定量构效关系(QSAR/QSPR)建模中的几个重要问题进行了研究;此外,对大规模分子结构数据库的生物标识亦进行了初步探索研究。本论文第一章首先阐述了OECD准则的内容及其对QSAR/QSPR研究的重要指导意义。然后,基于OECD准则要求,提出QSAR/QSPR建模中需要研究的几个重要问题,分别为提高QSAR/QSPR模型准确性和稳健性的方法、模型应用域定义方法及模型解释。第二章主要对提高线性QSAR/QSPR模型准确性和稳健性的方法进行研究。我们在非相关线性判别法(ULDA)的基础上进行改进,提出一种新的模型方法,此外,我们还提出了一种新的变量优选方法。我们采用新的模型方法结合变量优选方法(MULDA-RFE)对五组ADMET相关性质及一组Xa凝血因子抑制性数据进行QSAR/QSPR建模预测。结果表明,相对于原有算法,新方法得到的预测准确性和稳健性都有提高。通过与参考文献中一系列线性和非线性模型的比较,新方法的预测结果要优于或与这些模型的结果相当,说明新方法是一种很有效的QSAR/QSPR建模方法。同时,ULDA-RFE是线性的模型方法,在算法无歧义性和模型可解释性方面具有优势。第三章主要以气味分子在不同极性固定相上的保留指数为QSAR/QSPR模型研究对象,对提高偏最小二乘(PLS)线性模型预测的准确性和稳健性进行研究,并对影响气味分子在不同极性固定相保留行为的主要结构特征进行初步分析。得到以下结论:通过引入奇异样本检测的蒙特卡洛(MC)方法和随机青蛙变量选择方法,使模型的预测标准偏差(SDEP)大为减小,模型的R2和Q2参数都比之前有了很大的提高,这表明,奇异样本检测和变量选择方法使模型的预测准确性和稳健性都得到了极大改善。重取样预测误差的统计分布进一步证明了我们所提出的这一套QSAR/QSPR建模方法的有效性。第四章主要对QSAR/QSPR模型的准确性和稳健性、模型应用域定义方法及模型解释这几个重要问题进行比较全面的研究讨论。采用的QSAR/QSPR模型预测对象为四组重要的生物活性及毒性数据。在对QSAR/QSPR模型的准确性和稳健性的研究中,我们对比了几种有代表性的描述子和模型方法,结果表明:分子指纹结构描述符如MACCS和PubChem,在和适当的模型方法结合时,其模型准确性及稳健性与与计算型结构描述子Dragon相当;在各类模型方法中,支持向量机(SVM)和随机森林(RF)方法是两种准确性和稳健性较突出的方法。在模型应用域定义方法研究中,我们提出一种基于模型预测概率的应用域定义新方法,并与目前较为常用的基于分子结构相似性的应用域定义方法进行了对比,结果表明:我们所提出的模型应用域定义新方法要优于结构相似性的应用域定义方法;此外,在两种基于模型预测概率的方法中, Prob-SVM要稍优于Prob-RF方法。在模型解释的研究中,我们通过变量选择过程得到的重要分子描述子对各模型的构效关系进行一定分析解释。结果表明:采用适当的变量选择方法,能够为模型的解释提供极大的便利;而采用分子指纹作为结构描述子,可以更直观地挖掘与分子活性相关的结构信息,子结构类型描述子对于很多种类的活性预测有着重要作用。第五章中,我们对大规模分子结构数据库的生物活性标识作了初步的探索研究。主要采用PASS程序对接近一百万个化合物进行了生物活性标识;然后通过相似性结构搜索,对活性标识结果进行一定的检验和对比;此外,对活性标识中体现的生物化学型即优势骨架等有用信息也做了一定的挖掘。基于上述的工作,我们得到以下一些初步的结论和展望。我们提出了生物活性标识的重要性,但是,根据我们在大规模数据库生物标识实践中的初步结果分析,我们发现,大规模数据库的生物标识是一个极大的挑战,在今后还有很大的改善空间:需要在生物标识的准确性,生物活性标识的非黑箱性,生物标识的效率与准确性平衡、生物活性与生物化学型本体论定义等方面进行更深入研究。
【Abstract】 ABSTRACT: The main works of this dissertation are to study a few key problems in QSAR/QSPR (Quantitative Structure-Activity Relationship) modeling according to the requirements of the OECD (Organization for Economic Co-operation and Development) principles. Moreover, a study toward automated bioactivity annotation of large compound libraries is also carried out.In Chapter1, we have discussed the importance of OECD principles for QSAR/QSPR model validation. Based on the five OECD principles, we proposed that there are several key problems in QSAR/QSPR modeling need to be studied. These key problems include how to improve the accuracy and robustness of QSAR/QSPR models, how to define the applicability domain and interpretation of QSAR/QSPR models.In Chapter2, we studied on the method for improving the accuracy and robustness of QSAR/QSPR models. We have proposed an M-ULDA (Modified Uncorrelated Linear Discriminant Analysis) algorithm coupled with RFE (Recursive Feature Elimination) method for feature selection as a powerful QSAR modeling method. The QSAR studies on six data sets related to ADMET(Absorption, Distribution, Metabolism, Excretion and Toxicity) properties and inhibition activity of factor Xa were used to evaluate the performance of new method. The results of accuracy and robustness indicate that the new method is superior to the original method. And the comparison with other linear or nonlinear QSAR/QSPR methods has shown that the new method can provide comparable or better predictive accuracy. In addition, the new modeling method is easier to interpret with respect to the nonlinear methods.In Chapter3, the studies were mainly focused on the method for promoting the accuracy and robustness of PLS (Partial Least Squares) model. We have introduced the MC outlier detection method and random frog variable selection method recently developed by our laboratory in the QSAR model to predict retention index of237flavor compounds on four stationary phases with different polarity. And the important structural features relating to the flavor compounds’retention behavior on stationary phases with different polarity were explored. The results of SDEP (Standard Deviation Error of Prediction) and Q2show that the accuracy and robustness of PLS model can be significantly improved by using our new method for outlier detection and variable selection. This conclusion has been further confirmed by results of Monte Carlo test.In Chapter4, a comprehensive study on accuracy of QSAR/QSPR models, the applicability domain of QSAR/QSPR models and interpretation of models was carried out. Four sets of important bioactivity and toxicity were used for QSAR/QSPR study. For the study on accuracy and robustness of QSAR/QSPR models, we compared the performance of different types of molecular descriptor and modeling methods. The results indicate that the use of molecular descriptors of fingerprint type such as MACCS and Pubchem did not reduce the accuracy and robustness of QSAR/QSPR models compared with the theoretical type Dragon descriptors. Among the different modeling methods studies in this chapter, SVM and RF are superior concerning the accuracy and stability of predicting results. For the discussion about applicability domain of QSAR/QSPR models, we have proposed a novel method for defining the applicability domain. The new method based on predictive probability has been compared with a commonly used method which is based on molecular similarity. The results of assessment indicate that the new method is superior to the method based on molecular similarity. It seems quite reasonable to defining the applicability domain of QSAR/QSPR models by using the new method. Furthermore, we have found that the method based on probability of SVM (support vector machines) is better than that based on probability of RF (Random Forest). For the study on model interpretation, we mainly focused on the effect of variable selection and use of molecular fingerprinting. We have drawn the conclusion that variable selection and use of molecular fingerprinting are both very helpful for model interpretation since they can provide the important substructure related with the activity or property.Chapter5describes a process to automatically annotate biochemotypes of compounds in a library and thus to identify bioactivity related chemotypes (biochemotypes) from a large library of compounds. The process consists of two steps:(1) predicting all possible bioactivities for each compound in a library, and (2) deriving possible biochemotypes based on predictions. About a one million (982,889) commercially available compound library (CACL) has been tested using this process. This chapter has demonstrated the importance and feasibility of automatically annotating biochemotypes for large libraries of compounds. Moreover, we suggest the ways in which the systematic bioactivities prediction program should be improved. Firstly, a balance between the automated bioactivity annotation technology and data quality has to be found. The annotation process is very fast by using PASS program. It is equally important that accuracy not be sacrificed. Secondly, an ideal systematic bioactivity prediction tool must indicate privileged structures and be trainable by users. Thirdly, the definition of bioactivities (biochemotype ontology) needs to be better developed in future.