节点文献

蛋白质结构预测方法学研究

Study on Approaches of Protein Structure Prediction

【作者】 常珊

【导师】 王存新;

【作者基本信息】 北京工业大学 , 生物医学工程, 2009, 博士

【摘要】 随着人类基因组测序计划的完成,生命科学的重心开始转移到对基因的表达产物蛋白质的研究上来,蛋白质组学已成为后基因组时代的研究前沿和热点领域。蛋白质与配体相互作用以及蛋白质的结构与功能关系是后基因组时代研究的核心内容。研究蛋白质受体与配体间相互作用与识别对揭示细胞中蛋白质的分子生物学机理、计算机辅助药物设计和复合物结构预测都具有重要的意义。由于实验测定蛋白质复合物结构存在较大的困难,近年来,随着计算机处理能力的不断增强以及理论模拟方法的迅速发展和广泛应用,计算机分子模拟方法已经成为研究蛋白质受体与其配体相互作用机制的重要手段。本论文采用分子对接和氨基酸网络方法等分子模拟方法,对蛋白质间相互作用与识别机制进行了一系列基础性的研究工作。论文内容主要包括以下几个方面:(1)蛋白质分子对接打分函数的研究提出了两个打分函数。一个是针对Others类型蛋白质复合物的组合打分函数ComScore。ComScore由原子接触势、范德华和静电相互作用能组成,采用多元线性回归拟合权重。对CAPRI比赛的benchmark 1.0中的17个复合物的对接结构测试结果表明,该组合打分基本能够体现Others类型复合物的相互作用特征,反映出复合物形成前后的能量变化关系,具备一定的从大量采集构象中筛选获得有效结构的能力。ComScore被用于CAPRI的第9-12轮的打分比赛中,在第9轮和第11轮都取得了好的成绩。另一个是基于氨基酸网络的打分函数。蛋白质复合物拓扑结构给蛋白质-蛋白质相互作用机理的研究很多启示。在本论文中,建立了蛋白质的残基网络,其中蛋白质的残基被视为节点,残基之间的接触视为连接。根据残基类型将蛋白质复合物的残基网络分成两种类型,即疏水和亲水残基网络。分析这两种不同类型的网络,发现他们都具有小世界的性质。通过分析网络参量发现,正确结合的复合物构象比错误结合的结构具有更高的界面度值和更低的网络特征路径长度。这些性质反映出正确结合的复合物结构有更好的几何和残基类型互补,同时正确的结合模式对于保证天然蛋白质复合物的特征路径长度起着重要作用。此外,建立了两个基于网络参量的打分项,它们能够很好地反映复合物整体形状和残基类型互补特性。将基于网络的打分项与其他打分函数项进行组合后,提出了一个新的多项打分函数HPNCscore,它能够将RosettaDock组合打分函数的区分能力提高12%。上述工作能够扩展我们对蛋白质-蛋白质结合机制的了解,并可以用于蛋白质结构设计的工作。(2)蛋白质分子对接搜索方法的改进分子对接需要在尽量短的时间里搜索到能量低的结构,因此分子对接方法研究的另一个重要问题是快速有效的搜索算法,即采用新的理论和计算方法提升现有程序的计算效率。Autodock 3.0是一个被广泛采用的分子对接程序,它由美国Scripps研究所Olson等人开发,在预测蛋白质受体和配体间结合模式上取得了很好的成绩。本论文在分析Autodock 3.0串行程序的基础上提出并实现了5种不同的并行方案,从正确性、参数分析(包括5个不同输入参数)、并行进程数量的影响等多个角度对5个并行方案进行了测试和分析。在正确性测试中,并行方案五和原始串行程序分别应用于10个不同蛋白质-小分子体系的对接,对接结果比较证实了并行程序的正确性和可靠性。在参数分析测试中,通过改变能量评价次数、种群个体数、局部搜索概率、局部搜索迭代次数和对接次数等5个不同输入参数,分析了它们对不同方案的影响,这些测试将对并行程序在虚拟筛选中的应用起到指导作用。在并行进程数量测试中,第五个混合的并行方案由于结合多种方案的特点,随着进程数量的增加,程序依然能够充分合理地安排进程资源,保持较高的并行效率。并行化改造能够有效地提升Autodock分子对接软件的计算效率,将为计算机辅助药物设计和虚拟筛选提供一些帮助。另外,还采用蚁群算法对Autodock 3.0程序进行了改进,替换了原程序中进行全局搜索的遗传算法。在22个蛋白质-小分子体系上测试发现,蚁群算法能够有效地改善程序的搜索结果。同时,不管是否采用局部搜索的算法,蚁群算法比遗传算法都具有更好的性能和更快的收敛速度。新的优化算法-蚁群算法的引入将对分子对接软件的改进提供一些新的参考。(3)蛋白质氨基酸网络研究蛋白质分子的三维结构可以被视为由氨基酸组成的复杂网络,对网络性质的分析能够帮助理解蛋白质结构和功能之间的关系。由于蛋白质的氨基酸网络是在蛋白质折叠过程中形成的,通常的网络模型难以解释其演化的机制。基于蛋白质折叠的观点,提出了一个氨基酸网络的演化模型。在此模型中,演化从天然蛋白质的氨基酸序列开始,由两个基本假设进行引导,即近邻偏好性规则和能量偏好性规则。研究发现近邻偏好性规则主要决定通常的网络性质,而能量偏好性规则主要决定特殊的生物学结构特征。应用于天然蛋白质体系发现,该模型能够很好地模拟出氨基酸网络的性质。另外,建立并研究了蛋白质保守残基的无向网络。标识蛋白质结合界面是蛋白质-蛋白质相互作用预测以及蛋白质分类的重要环节。在本论文中,蛋白质结构被视为一个无向网络,其中保守性残基为网络节点,残基之间的接触视为连接。研究发现,保守性残基网络具有介于规则网络和随机网络之间的聚集系数和特征路径长度,属于小世界类型的网络。蛋白质复合物界面的残基比表面残基通常具有大的度值和低的聚集系数。此外,还发现了保守残基的空间聚集是一个普遍现象。保守性残基网络的性质将能够给蛋白质-蛋白质界面预测提供一些新的参量。

【Abstract】 With the completion of Human Genome Project (HGP), the life science has focused on the study of the gene expressed products, i.e. proteins, and proteomics researches have become the pioneering and hot themes of post-genomic era. The core of the post-genomic era research has gone into the interactions between protein and ligand and the relationship between protein structure and function. The studies of the interactions and recognition between protein receptor and ligand are very important for understanding the molecular biology mechanism of proteins in cell, the computer-aided drug design and the structure prediction of protein-protein complex.It is difficult to determine a protein complex structure through experimental methods. Recently, with the continuous progress of the computers’processing ability, as well as the rapid development and extensive application of theoretical simulation, molecular modeling methods have become important tools for exploring the interaction mechanism of protein receptor with ligand. In this thesis, a series of studies on the mechanism of interactions and recognition of protein have been done by using the molecular docking and amino acid network methods. The content of the thesis contain the following major aspects:(1) Study on the scoring function of protein molecular dockingTwo scoring functions were proposed. One was a combinatorial scoring function, ComScore, which was specially designed for the Others-type protein-protein complexes. ComScore was composed of the atomic contact energy, van der Waals and electrostatic interaction energies, in which the weight of each item was fit through the multiple linear regression approach. The test result on 17 Others-type complexes from CAPRI benchmark 1.0 demonstrated that the combinatorial scoring function can delineate the interaction feature of the Others-type complexes, reflect the energy change during the complex formation, and have a certain capacity to discriminate effective structures from numbers of the docked decoys. ComScore was used in the scoring test for CAPRI rounds 9-12 with the good results in rounds 9 and 11.The other one was a scoring function based on the amino acid network. The topology study on protein-protein complexes will give some insights into the mechanisms of protein-protein interactions. In this thesis, residue networks were constructed by defining the amino acid residues as the vertices and atom contacts between them as the edges. The residue network of a protein complex was divided into two types of networks, i.e., the hydrophobic and the hydrophilic residue networks. Analyzing these two different types of networks, we find that these networks are of small-world properties. Furthermore, through analyzing the network parameters, it is found that the correct binding complex conformations are of both higher sum of the interface degree values and lower characteristic path length than those of the incorrect ones. These features reflect that the correct binding complex conformations have better geometric and residue type complementarity, and the correct binding modes are very important for preserving the characteristic path lengths of the native protein complexes. In addition, two scoring terms are proposed based on the network parameters, in which the characteristics of the entire complex shape and residue type complementarity are taken into account. These network-based scoring terms have also been used in conjunction with other scoring terms, and the new multi-term scoring HPNCscore has been devised. It can improve the discrimination of the combined scoring function of RosettaDock more than 12 %. This work can enhance our knowledge of the mechanisms of protein-protein interactions and recognition and also be used in protein design.(2) Improvement on the searching algorithm of protein molecular dockingIt is required for molecular docking to search structures with lower energy in less time. Therefore, another important issue of molecular docking is how to have the efficient searching algorithms. In other words, the efficiency of docking programs will be improved by using new theories and computational methods. AutoDock 3.0 is a widely used docking program developed by the Professor Olson’s group at the Scripps Research Institute, which has achieved great success in the prediction of the binding modes and conformations between protein receptor and ligand. In this thesis, based on the analysis of the algorithm of AutoDock 3.0, we proposed 5 parallel methods using the message passing interface (MPI) library. We tested and analyzed these methods for reliability, parameter analysis (including 5 input parameters) and the influences of the numbers of processors. In the reliability test, the parallel scheme 5 and the serial program were applied to 10 protein-ligand systems and the docking results indicate the validity and reliability of the parallel programs. In the parameter analysis, we changed 5 different parameters, including the numbers of energy evaluations, the population sizes, the frequencies of local search, the iteration numbers of local search and the numbers of docking runs. The influences of those parameters on the different 5 schemes were analyzed, which will guide for the parallel program in the virtual screening. In the test of the numbers of processors, the hybrid parallel scheme 5 has the characteristic of other schemes. With the processor increasing, the hybrid scheme can effectively use the processors and keep higher speedup and parallel efficiency. The parallel improvement can enhance the efficiency of the molecular docking program AutoDock 3.0, which can give some help to the computer-aided drug design and virtual screening.In addition, instead of the global search of Genetic Algorithm (GA), we improved AutoDock 3.0 by using the Ant Colony Optimization (ACO) method. Tested on the 22 protein-ligand systems, it is found that the ACO method can make an improvement for the searching results. Meanwhile, it is found that whether with the local search or not, the performance of ACO is obviously better than that of GA and the energy convergence rate of ACO is also quicker than that of GA. The new search technology, the ACO method, has been introduced and it will give some advices for the improvement of docking programs.(3) Study on the amino acid networks of proteinsThe three-dimensional structure of a protein can be treated as a complex network composed of amino acids, and the network properties can help us to understand the relationship between structure and function of protein. Since the amino acid network of a protein is formed in the process of protein folding, it is difficult for the general network models to explain its evolving mechanism. Based on the perspectives of protein folding, we proposed an evolving model for the amino acid networks. In our model, the evolution starts from the amino acid sequence of a native protein and it is guided by two generic assumptions, i.e. the neighbor preferential rule and the energy preferential rule. It is found that the neighbor preferential rule predominates the general network properties and the energy preferential rule predominates the specific biological structure characteristics. Applied on native proteins, our model can mimic the features of the amino acid networks well.In addition, the conservation residue network was constructed and studied. Identifying protein interface is crucial for the prediction of protein-protein interactions and for protein functional classification. In this thesis, the protein structure was modeled as an undirected graph with the conservation amino-acid residues as the vertices and atom contacts between them as the edges. It is found that the conservation residue networks are characterized by intermediate values of clustering coefficient and characteristic path length, which are the typical property of the small-world networks. The residues on the protein interfaces typically have higher degree values and lower clustering coefficient values than that of the surface residues. Additionally, it is detected that the spatial clustering of the conservation residues is a general phenomenon. These results indicate that the conservation residue network propensities can give us some new parameters in protein–protein interface prediction.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络