节点文献

Glide打分函数中蛋白间打分噪音的发现和修正

The Discovery and Correction of Interprotein Scoring Noises in Glide Docking Scores

【作者】 王玮

【导师】 陈新;

【作者基本信息】 浙江大学 , 生物信息学, 2012, 博士

【摘要】 极少小分子药物在与其靶点蛋白相互作用中具有足够的专一选择性。药物与非预期的非靶点蛋白结合常常导致副作用,但偶尔也引发一些新的治疗作用。因此,识别化合物的非靶点蛋白对于评价该化合物的研发潜力具有重要意义。在计算生物学中,反向对接的方法可以预测化合物的靶点蛋白。反向对接使用一个化合物(诱饵)对蛋白质库(猎物)进行虚拟筛选,这与正常对接过程中使用蛋白质(诱饵)对小分子库(猎物)进行筛选相反。本研究发现,在反向对接过程中对打分函数的针对性优化能够提高靶点蛋白预测的准确性。本研究选择的标准数据集Astex Diverse数据集是一个含有85个配体-受体蛋白复合物,且结构多样性的高质量数据集。对接软件Glide中的“标准精度”模式下的打分函数GlideScore,能够精确重复该数据集中58个配体小分子-受体蛋白复合物的晶体结合构象。但在针对这58个复合物的反向对接过程中,GlideScore只能够正确识别57%的配体小分子-受体蛋白关系。其原因可能是GlideScore对某些蛋白过高或者过低的打分,即GlideScore存在不同蛋白之间的噪音。分析成功和失败的反向对接例子发现,蛋白质特性“Balance"与反向对接的结果强烈相关。"Balance"为靶点蛋白结合位点的疏水性面积和亲水性面积之间的比值。通过引入一个以"Balance"为核心的修正项,能将小分子靶点蛋白的预测准确性提高27%(从57%提升至72%)。新的打分函数命名为BCGlideScore,它在另一个同质的额外测试集上也能以类似的幅度提高反向对接的准确率29%(从47%提升至60%)。分析发现,BCGlideScore的三个特性与提高反向对接准确率有关:加入的修正项能够减少“蛋白间”的噪音;加入修正项后的BCGlideScore与"Balance"之间的相关性减少;修正项可能代表了一个粗糙的蛋白质熵的变化的估计。“额外精度”模式为Glide中的另一个分子对接模式。该模式中的构象搜索算法和打分函数是为了更好的估计配体-受体蛋白亲和力而优化。使用与“标准精度”模式类似的分析流程发现,“额外精度”模式中获得最高反向对接准确率的打分函数XPEmodelScore同样存在“蛋白间”的噪音,但是由于候选小分子和蛋白特性种类和数量有限,没能成功发现与XPEmodelScore中“蛋白间”噪音强烈相关的特性,也就不能修正XPEmodelScore。我们提出相互作用的指纹描述有很大的潜能用来修正“蛋白间”噪音。另外,尽管显著提高打分函数与亲和力之间的相关性,能够提高化合物库筛选与蛋白质库筛选的准确性。但是,我们的结果显示XPGlideScore与亲和力的相关性稍高于标准模式下的GlideScore,但与GlideScore57%的反向对接准确率相比,XPGlideScore并不能很好的预测靶点(仅仅22%的正确率),这表示少量的提高打分函数与亲和力之间的相关性,未必能够转化成蛋白质库筛选准确性的提高。本研究还发现,分子对接中三个打分目标(预测最优结合构象、预测能够与受体蛋白结合的小分子和预测小分子靶点蛋白)分别侧重于配体-受体蛋白结合的不同方面。因此,为各自不同的目标开发专门的打分函数将会是可行和更加有效的。尽管能够满足所有目标的“全能”打分函数是存在的,但这种打分函数往往需要很大的计算量。而为不同的打分目的开发专门的功能,能够为每个专门的功能减少对精确度要求而减少计算量。同时,准备更全面,更有代表性的数据集来训练和测试更多专门的打分函数可能更加容易。因此,将打分目标分开可能是发展更加简单但更有效的打分系统的关键。这是目前首次对反向对接中打分函数的蛋白间噪音的报道和修正,我们希望本次研究能够引起对所有打分函数中类似蛋白间噪音的进一步研究和规范,最终能够使用反向对接更准确地预测小分子化合物的作用靶点。我们也将继续为发展针对反向对接的打分函数而继续努力。

【Abstract】 Small molecule drugs are rarely selective enough to interact solely with their designated targets. Unintended "off-target" interactions often lead to side effects, but also serendipitously lead to new therapeutic uses. Identification of the off-targets of a compound is therefore of significant value to the evaluation of its developmental potential. In computational biology, the strategy of "reverse docking" has been introduced to predict the targets of a compound, which uses a compound to virtually screen a library of proteins, reversing the bait and prey in "normal" docking screenings.The present study shows that, in reverse docking, additional optimization of the scoring function may help to improve the target prediction accuracy. We chose Astex Diverse dataset which was a diverse, high-quality dataset containing 85 ligand-protein complexes as our standard example dataset. GlideScore in the "standard precision" mode of Glide could accurately reproduce the crystal binding conformation of 58 complexes in Astex Diverse dataset. But in the reverse docking of those 58 complexes, we found that only 57% of the ligand-protein relationships could be correctly identified. This was likely a result of the constant over-or under-estimation of the GlideScores for specific proteins. In other words, there were interprotein noises in the Glidescores. Using decision tree to classify the successful and unsuccessful reverse docking cases, we found a protein descriptor balance was strongly associated with successful/unsuccessful target predictions. The balance descriptor expresses the ratio of the relative hydrophobic and hydrophilic character of the binding site. Introducing a correction term based on balance improved the target-prediction accuracy by 27%(57-72%). And the new score was named BCGlideScore. It also improved the target-prediction accuracy by 29%(47-60%) on an external test dataset having a similar quality to the Astex Diverse dataset. BCGlideScore had three features associated with the target-prediction improvement:the balance based correction term corrected of the "interpocket" noises, the correction term reduced the correction between the balance descriptor and the BCGlideScore and the correction term might represent a rough estimation of protein entropic changes. The "extra precision" mode (XP) whose conformation search and scoring function are optimized for better correlation between docking score and binding affinity is another mode in Glide for molecular docking. Using a similar analyzing protocol with "standard precision" mode, we found XPEmodelScore showed the highest accuracy in target prediction and our data indicated that there were interprotein noises in the XPEmdoelScores. However, unfortunately, we were unable to identify any ligand or protein property that was strongly associated with the noises and had the potential to correct XPEmodelScores. This was likely a result of our small descriptor pool. With more descriptors to characterize the ligand/protein properties, we might be able to find one suitable property for noise correction. In this regard, interaction-fingerprints may have a big potential to be used for this purpose. In addition, it is for sure that significantly increased correlation between docking score and binding affinity will improve the prediction accuracy in both compound library screening and protein library screening. But our results showed that the XPGlideScores did show better correlation with binding affinity than the standard mode GlideScores. XPGlideScore showed poor performance in target prediction (only 22.0% success) comparing with GlideScore’s accuracy of 57%. The above results suggested that slightly improved correlation may not necessarily translate to improved accuracy in protein library screening.We also found that each of the docking scoring objectives (the prediction of the optimal binding conformation, the prediction of the potential protein-binding ligands and the prediction of the potential of targets of a ligand) emphasizes on different aspects of ligand-protein binding. So it may be possible and more effective to develop specialized scoring functions for individual objectives. Theoretically, an omnipurpose scoring function exists, but it always requires intensive computation to estimate. Developing specialized functions for different scoring objectives is a strategy that can reduce the precision requirement for each specialized function. Preparing more comprehensive and representative datasets to train and test more specialized scoring functions might be easier. Therefore, separation of scoring objectives may hold to key to developing simpler yet more effective scoring syste This is the first discussion about the discovery and correction of the interprotein scoring noises in reverse docking. It is our hope that this focused discussion on the Glide scores would invite further efforts to characterize and normalize this type of interprotein noises in all docking scores, so that better target prediction accuracy can be achieved with the strategy of reverse docking. And we will continue to work for developing specialized scoring functions for reverse docking.

  • 【网络出版投稿人】 浙江大学
  • 【网络出版年期】2012年 09期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络