节点文献

计算机辅助药物和蛋白性质预测研究

Computer-Aided Prediction of Properties of Drugs and Proteins

【作者】 席莉莉

【导师】 姚小军;

【作者基本信息】 兰州大学 , 化学信息学, 2010, 博士

【摘要】 近些年来,组合化学和高通量筛选技术的不断发展产生了大量化学、生物和药物等方面的相关数据。但是,得到化合物小分子的结构以及生物大分子的序列和结构的速度远远大于得到其相对应的性质和功能数据的速度,这在一定程度上阻碍了研究者们对知识的探索。而计算机辅助方法则提供了一条可取的有效途径。本论文所关注的就是利用计算机辅助的方法对蛋白质大分子、药物小分子的性质进行预测研究,以及对配体-蛋白质相互作用模式和相关生理活性进行预测研究。一方面的目的是利用已知数据建立准确、快速的预测模型,以对未知样本的性质进行预测;另一方面的目的是通过对所建立的预测模型的解释,期望在一定程度上揭示影响性质的关键因素,能够为样本的优化提供有用的信息;最后,期望计算机辅助预测这种方法以及所建立的预测模型能够应用于实际中,有助于筛选出符合研究者要求的化合物分子,节约实验成本、提高筛选速度、缩短实验周期。论文的第一章概述了计算机辅助性质预测方法的基本原理,从数据的获取和处理、研究样本的特征表示、模型的建立、评价和验证这些方面进行了详细的阐述。另外,介绍了计算机辅助配体-蛋白结合模式预测—分子对接方法的原理。最后,对本论文所使用的算法进行了介绍。论文的第二章,我们将计算机辅助性质预测方法应用于蛋白质大分子的性质预测。具体研究内容是蛋白质折叠过程中最基本的两个方面:蛋白质折叠速率的定量预测和蛋白质折叠途径类型的模式识别研究。在第一个研究工作中,完全从蛋白质序列角度出发,采用氨基酸序列自相关方法对101条蛋白质序列进行了表征,基于遗传算法选择出来的关键特征,分别采用全局(multiple linear regression, MLR)和局部(local lazy regression, LLR)模型来预测蛋白质的折叠速率,结果显示局部模型的预测能力优于全局模型。而且,3-fold、5-fold和10-fold交互检验的结果也表明了局部模型具有很好的预测能力和稳定性。此外,我们还分析了影响蛋白质折叠速率的关键特征:未折叠的熵变、疏水作用、二级结构偏好以及残基柔性。在第二个研究工作中,同样地,完全从蛋白质序列角度出发,采用氨基酸序列自相关方法对101条蛋白质序列进行了表征,使用支持向量机-递归特征消除(Support Vector Machine-Recursive Feature Elimination, SVM-RFE)对所有计算得到的特征根据其支持向量的权重进行重要性排序,通过抽一法(leave-one-out, LOO)交互检验的结果,最小二乘支持向量机(least square-support vector machines,LS-SVMs)方法最终使用排在前7个的重要特征建立了分类模型,准确率为91.09%,Matthews相关系数为80.88%。3-fold、5-fold和10-fold交互检验的结果也显示了所建立的预测模型的预测能力和稳定性。另外,我们还分析了氨基酸性质对蛋白折叠途径类型的影响,比如未折叠的自由能、疏水性、二级结构分布以及电荷分布等。论文的第三章,我们将计算机辅助法应用于配体-蛋白的相互作用模式和相互作用强度的预测。在第一个研究工作中,采用从蛋白、配体和蛋白-配体复合物角度出发的组合分子建模方法分析了58个分子对白明胶酶MMP-2和MMP-9的结构-活性关系和结合模式。(1)蛋白角度:蛋白分子的序列比对和结构叠合能够更好地了解蛋白的活性位点信息;(2)抑制剂小分子角度:QSAR研究可以准确预测小分子的抑制活性,并提供影响活性的关键结构特征的信息;(3)蛋白-配体复合物角度:分子对接研究能够识别关键残基以及更好地理解蛋白-配体的关键相互作用。这种从多角度出发的研究策略能够提供很多重要的信息,并且为将来设计新的MMPs抑制剂分子提供了一种思路。在第二个研究工作中,以一系列新型MMP-13抑制剂分子为研究对象,关注了QSAR研究中的两个重要问题:活性构象的选取和描述符的表征。在MMP-13受体结构已知的情况下,通过精确的分子对接程序Glide将所有待研究的化合物分子对接到MMP-13的活性位点处,获得化合物分子的活性构象。在描述符表征部分,使用了配体分子的结构描述符、ADME性质相关的描述符以及表征配体和蛋白相互作用的描述符,通过遗传算法选择出影响化合物分子抑制活性的关键描述符,同时建立了MLR模型(全局模型),内部检验和外部检验都证明了其具有稳定性和预测能力。考虑到局部模型的优势,我们还建立了LLR模型,与全局模型相比,局部模型能显著提高模型的预测能力。论文的第四章,我们将计算机辅助性质预测方法应用于类药分子ADME/Tox相关性质的预测研究中。在第一个研究工作中,选取CYP2C19作为研究对象,基于7750个结构多样性的化合物分子,采用随机森林(random forest, RF)方法建立了识别CYP2C19底物分子的分类模型。基于6200个训练集样本,RF选出了19个重要的描述符,并且建立了分类模型,然后对1550个外部测试集样本进行了预测,结果显示外部测试集的预测准确率可达93.42%,Matthews相关系数达到80.36%。所建立的RF模型运行速度快,且识别精确度高,可以在药物研发的早期阶段用于识别CYP2C19的底物分子,从理论水平上为设计药物分子的研究者提供有用的信息,减少通过代谢导致的药物-药物相互作用的发生概率,提高药物的有效性、安全性。在第二个研究工作中,基于947个结构多样性的化合物分子,采用SVM-RFE方法对计算得到的描述符根据其支持向量的权重进行了重要性排序,用LS-SVMs方法建立了识别是否能引起药物性肝损伤的化合物分子的分类模型。基于710个训练集样本,通过LOO交互检验的结果,LS-SVMs最终使用排在前15个的重要描述符建立了分类模型,准确率达到76.48%,对237个外部测试集样本的预测准确率可达到70.04%。所建立的分类模型可以应用于判断化合物分子是否能引起人类肝细胞毒性,尤其是对能引起肝毒性的化合物分子的判断非常准确,说明理论计算方法是一种非常有效的预测工具,可以应用到其他许多ADME/Tox相关性质的预测上,并且可以在新药研发的早期阶段为研究者提供有用的信息,可能在一定程度上提高药物的筛选速度。

【Abstract】 Recently, with the development of combinatorial chemistry and high-throughput screening techniques, a vast of data related to chemistry, biology and drug are produced. However, the speed to obtain the molecular structures and the sequences/structures of biomacromolecule is much faster than that of the corresponding properties or function data, which has kept researchers from exploring knowledge to some extent. The computer-aided method to predict properties is a very effective approach.This dissertation is concerned to use the computer-aided method to study the properties of proteins and drugs, and to study the interacting mode between ligand and protein and related bioactivities. The purpose is to build accurate and fast predictive model using the known data to predict the properties of unknown samples; on the other hand, the purpose is that through exploring the developed predictive model, we hope to reveal the critical factors influencing the studied properties, which can provide some useful information to optimize samples. Finally, we expect the idea of the computer-aided method and the built models have its practical use, and to help screening the required molecules, saving experimental cost, improving the speed of screening and reducing experimental time.In Chapter 1, a brief introduction of principle of the computer-aided method was given. From acquisition of data, pre-processing of samples, characterization of studied samples, the development of a stable, reliable and predictive model to validation and assessment of model, all of these aspects of the computer-aided method were described in detail. In addition, we also introduced the principle of molecular docking method that studies the interacting mode between ligand-protein. Finally, the algorithms used in this dissertation were presented.In Chapter 2, we applied the computer-aided method to predict the properties of proteins. Concrete research content includes the two basic aspects of protein folding process:quantitatively predicting folding rates and recognizing the type of protein folding pathway. In the first work, our main purpose is to develop a general, fast and accurate model to predict the protein folding rates completely based on the information of the protein sequences. The information of amino acid sequence autocorrelation (AASA) was employed to represent 101 protein samples. Based on the significant features selected by genetic algorithem (GA), the global (multiple linear regression, MLR) and local (local lazy regression, LLR) methods were employed to develop prediction models for protein folding rates. The LLR method performed better than MLR. The three-fold, five-fold and ten-fold cross validation results showed that the local model was more robust and stable than the global one. Furthermore, we analyzed the significant features including unfolding entropy changes, hydrophobicity, secondary structure tendency and flexibility that have great effect on folding rates. In the second work, the same 101 protein sequences were employed, and the information of amino acid sequence autocorrelation (AASA) completely based on sequence was used to represent protein samples. Support vector machine-recursive feature elimination (SVM-RFE) was used to rank all the calculated features according to weight of support vectors. According to the results of leave-one-out validation method, least squares-support vector machines (LS-SVMs) was used to build classification model using toped seven features. The accuracy was 91.09%, and MCC value was 80.88%. The three-fold, five-fold and ten-fold cross validation results showed that the built classification model was stable, reliable and predictive. Additionally, we analyzed the significant features to reveal the factors influencing the type of protein folding kinetics pathway, and found out that amino acid properties, unfolding Gibbs free energy change, hydrophobicity, secondary structures and charge, play vital roles in the behavior of protein folding.In Chapter 3, we applied the computer-aided method to predict the interacting mode and interacting strength between ligand and protein. In the first work, the combined molecular modeling approach from the perspective of protein, ligand and their complex were employed to obtain some insights into the structure-activity relationship, interaction mode between protein and ligand of 58 novel gelatinases potent inhibitors. (1) Perspective of protein:sequence alignment and structure superimposition can provide better understanding of the binding site of proteins. (2) Perspective of inhibitors:the QSAR study of 58 inhibitors can give accurate prediction of activity and gain some insights into the structural features responsible for the activity. (3)Perspective of protein-ligand complex:molecular docking study was performed to identify the key residues and critical interactions between the ligands and proteins. This research strategy from multi-angles can provide more important information, and present a new way for the further design of new potent inhibitors. In the second work, a series of new inhibitors of MMP-13 were taken as the research object, and we focus on two important issues in QSAR study:the selection of active conformation and the characterization of samples. When the three-dimensional structure of MMP-13 is known, the accurate molecular docking program—Glide was employed to dock all the studied compounds into the active site of MMP-13, and then active conformation for each compound was obtained. In the section of characterization, structural descriptors and descriptors related to ADME were calculated, and the descriptors based on the docked ligand-protein complex conformation were also calculated to describe the interaction between ligand and protein. Genetic algorithm was used to select the important descriptors influencing on inhibitory activity, at the same time, MLR model (i.e. the global model) was constructed, and both internal and external validation showed the built model was stable and predictive. Considering the strength of the local model, LLR model was also developed. Compared with the global one, the local model can significantly improve the predictive ability.In Chapter 4, we applied the computer-aided method to predict the properties related to ADME/Tox of drugs. In the first work, CYP2C19 was taken as the research object. Based on the diversified structures of 7750 compounds, random forest (RF) was employed to develop a classification model to recognize the substrates of CYP2C19. Based on 6200 compounds in training set, RF selected 19 important descriptors and built classification model. Then, this model was performed to predict 1550 compounds in external test set, which showed the accuracy was 93.42%and MCC value was 80.36%. The developed model had higher classification speed and more accurate recognition rate, which can be applied to recognize substrates of CYP2C19 in early-stage of drug discovery. We expect that it can help to provide useful information from the level of theory for researchers, reduce the probability of drug-drug interaction caused by metabolism, and improve the effectiveness and safety of drugs. In the second work, based on the diversified structures of 947 compounds, SVM-RFE was used to rank all the calculated descriptors according to weight of support vectors. LS-SVMs algorithm was employed to build classification model to recognize the compound that induced hepatic injury. Based on 710 compounds in training set, according to the results of leave-one-out validation method, the toped fifteen descriptors was used to build LS-SVMs classification model, and the accuracy was 76.48%. For 237 compounds in external test set, the accuracy had achieved 70.04%. Our results showed that the built classification model can be applied to determine whether one compound can induce human hepatocytes toxicity, especially the reorganization of compound that can induced hepatic injury was very accurate, which showed that the computer-aided method was a very effective tool and can be applied to predict other properties related to ADME/Tox. Furthermore, the computer-aided method can be used in early-stage of drug discovery, help to provide useful information, and improve the screening speed to some extent.

  • 【网络出版投稿人】 兰州大学
  • 【网络出版年期】2010年 10期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络