节点文献

蛋白质关联图预测研究

Research on Prediction of Protein Contact Maps

【作者】 刘桂霞

【导师】 周春光;

【作者基本信息】 吉林大学 , 计算机应用技术, 2007, 博士

【摘要】 生物信息学中一个基础而尚未解决的问题是从氨基酸序列预测蛋白质的三维结构。目前,使用全分子建模方法得到蛋白质的三维结构仍然十分困难,因此,预测蛋白质三维结构的一个中间步骤----预测蛋白质残基间的关联图应运而生并得到快速发展。蛋白质关联图包含着蛋白质折叠信息和空间结构的重要信息,因此它的解决对于蛋白质折叠识别意义重大。在获取了蛋白质关联图信息后,蛋白质三级结构重建将变得比较简单,而且根据蛋白质关联图信息重建蛋白质三级结构的方法已日渐成熟。同时,在蛋白质结构比对方法中,蛋白质关联图叠合法是唯一不用预先计算蛋白质结构的方法。蛋白质的结构决定功能,因此蛋白质关联图预测问题的解决对蛋白质空间结构预测和蛋白质功能预测都有着极其重大的意义。计算智能是一种仿生计算方法,它从生物底层对智能行为进行模拟和研究,拓展了传统的计算模式,它具有在不确定及不精确环境中进行推理和学习的卓越能力,是建立智能系统的有效计算工具。随着人类基因组计划的实施,以及更多生物基因组测序计划的完成,计算智能在生物信息学中得到了广泛的应用。本文在全面分析和了解了蛋白质结构预测的研究现状、研究热点和发展趋势的基础上,重点研究了人工神经网络和人工免疫系统在蛋白质关联图预测中的应用。本文的主要贡献和研究成果如下:(1)对蛋白质关联图预测方法研究背景、研究现状、研究意义及相关概念进行了全面的综述。(2)对后基因组时代的生物信息学和蛋白质结构及其预测原理方法进行了综述,阐述了计算智能的相关理论,介绍了人工神经网络和人工免疫系统的基础理论,最后对计算智能在生物信息学中应用做了全面总结和归纳。(3)提出了基于偏置递归神经网络蛋白质关联图预测实现方法。(4)提出了基于暂态混沌神经网络蛋白质关联图预测研究方法。(5)提出了基于克隆选择算法的蛋白质联系图预测模型。本文的研究成果丰富了计算智能理论的应用研究,在递归神经网络、混沌神经网络、克隆选择算法等方面的研究具有一定的理论意义和应用价值,为蛋白质关联图预测研究提供了有意义的方法和手段。

【Abstract】 Computational intelligence (CI) is a computing methodology from nature, which simulates and researches the intelligent behavior from the lowest level of the creature. CI develops the traditional style of computation and provides a new a pproach for solving complex problems.It has the capability of reasoning and learning from the infinite and inaccuracy environment CI is the powerful computational tool for building more intelligent system..There are several main methods in it :fuzzy system, artificial neural network, genetic algorithms,And Artificial Immune System. Computational Intelligence(CI)has been advancing rapidly in Recent years,and found applications in many fields, such as pattern recognition, machine learning,knowledge discovery,data mining.A great usage of it is in a newly evolved branch of science: bioinformatics. The accomplishment of the Human Genome Project (HGP), and the completion of more other genomes, Computational Intelligence will play bigger roles in computational biology and bioinformatics.CI have been used for analyzing the different genomic sequences, protein structures and folds, and gene expression data.At the same time, CI have been used for a fast sequence comparison and search in databases, automated gene identification, efficient modelling and storage of heterogeneous data, etc。Since the work entails processing huge amounts of incomplete or ambiguous biological data, learning ability of neural networks, uncertainty handling capacity of fuzzy sets and searching potential of genetic algorithms are synergistically utilized. Computational intelligence poses several possibilities in Bioinformatics,particularly by generating low-cost, low-precision, good solutions.The proteins,macro-moleculesen coded by DNA,chemical unit of which is the amino acid,attach greatly close importance to biological activities of the mankind.By combing some amino acids, a continuous long chain with spatial structure formed and the life, proteins come into being. The proteins are the basic elementary component of while they are responsible for carrying through the functions of body cell. The genome sequencing result demonstrates that in the human body there are about one hundred thousands kinds of diferent proteins, every of which possesses unique function and purpose, that realizing the function protein is completed through the efect between the structures of proteins and other molecules. The result tells us knowing about the structures of proteins is the key to grasp the function in grain. From the above, we can say that it is not exaggerated that the problem of protein structure prediction is one of the magnificent research domains of bioinformatics in twenty-first century. In the era of post-genome,the sharp increase of the biological information urges the batch processing methods by computer, which leads to the birth of the Bioinformatics. Currently, the main research field of the bioinformatics now is gene regulation and the study of protein structure and function, and protein structure prediction is the preliminary step of the latter work. In which secondary structure prediction has been brought to maturity, whereas the 3D-structure prediction of protein is still at its early stage and needs further investigation. The present protein structure prediction methods can be simply classified as ab initio prediction based on minimal energy principle and the way of protein correlative information learning. Each of them has its preponderances and shortcomings: the energy minimization method is more adaptive and highly independent, but it is hard to formulate the energy function. Even if a comparatively precise energy function is made, the grand compute scale caused by numerous parameters and the tiny energy difference between the formations which is only on the level of 1kcal/mol,make the prediction difficult. The prediction using correlative information is more precise, especially for the homological proteins, but it is extremely restricted by the known protein structure database, and is less universal. the accuracy of the methods which predict the three-dimensional structures directly from the amino acids sequences is not high enough, so intermediate steps, such as residue contacts prediction , and residue spatial distance prediction, were put forward and have been developed rapidly recently. Contacts between protein residues constrain protein folding and characterize different protein structures. Therefore its solution may be very useful in protein folding recognition and de novo design. It is much easier to get the major features of the three-dimensional (3D) structure of a protein if the residue contacts are known for the protein sequence, and methods that reconstruct the protein structure from its contact map have been developed. A similarity based on contact map overlaps is the only approache for structural comparison that does not require a pre-calculated set of residues equivalences as one of the goals of the method.There are a variety of measures of residues contact used in the literature. Some use the distance between the Cα-Cαatoms , while others prefer to use the distance between the Cβ-Cβ. Contact maps are two dimensional, binary representations of protein structures. For a protein with N residues, the contact map for each pair of amino acids k and l (1≤k, l≤N), will have a value C(k,l)=1, if a suitably defined distance d(k,l)<dthr ,where dthr is a user-defined threshold distance between the amino acids, and C(k,l)=0 otherwise. We consider two residues to be in contact if the distance between their Cαatoms is less than 8?. while their sequence separation is not smaller than 7. Protein topology representations such as residue contact maps are an important intermediate step towards ab initio prediction of protein structure. Although improvements have occurred over the last years, Still, accurate prediction of residue contact maps is far from being achieved and limitations of existing prediction methods have again emerged at CASP6 and from automatic evaluation of structure prediction servers such as EVA .Based on understanding and analyzing the actual rearch state,research focuses and development trend in domain of protein structure prediction ,this dissertation focus on the application of artificial neural network and artificial immune system in the prediction of protein contact map , The main contributions of this dissertation are summarized as follows:(1)This dissertation makes a survey about protein structure prediction and prediction of protein contact maps,including the appearing background ,the research state and significance.( 2 ) In this dissertation,the relevant Computational Intelligence theories are expatiated,including Deviation Units Recurrence Neural Network, Transiently Chaotic Neural Network, Artificial Immune System, Clonal Selection Algorithm.Meanwhile,making a survey about Computational Intelligence in Bioinformatics,including protein structure prediction,prediction of protein contact maps, multiple sequence comparisons.and gene expression data .(3)To deal with the weakness of the BP neural network in learning speed, an Deviation Units Recurrence Neural Network model is presented based on the Jo rdan and Elman neural network . The weight-regulating method is developed based on BP algorithm. Simuations on fault diagnosis are performed with this neu ral network model. Experimental result s show that the converging speed of this network model is faster than the traditional BP network and this model has a good practicability.In this dissertation, we capture two features of the amino acids: predicted secondary structure and hydrophobicity. The predicted secondary structures for each protein are obtained by using DSSP, we use 3 neurons to denote the 6 possible secondary structure pairs since a amino acid residue has three possible secondary structures:α-heLix,β-sheet and coil. Hydrophobicity is a measure of nonpolarity of the side chains. As the nonpolarity (hydrophobicity) of the side chain increases, it avoids being in contact with water and buried within the protein nonpolar core. This is seen as the essential driving force in protein folding. This quantity is used to encode residue specific information to the network. Since the hydrophobicity of a residue affects the non-covalent bonding between its surroundings, it can be a contributing factor to contact decision of that residue with others.In our thesis , major characteristic of the neural network is that they have ten conjunction units which are used to take into accont the influence of neighbors pairing and sequence global correlation .Another important characteristic is that the network used a novel binary input encoding.The method could assign protein contacts with an average accuracy of 0.26 and with an improvement over a random predictor of a factor greater than 8。(4)A algorithm based on chaotic neural network is proposed to solve the protein contact maps problem . The proposed neural networks have many merits which are transient chaos and stable convergence etc. so as to overcome the drawbacks of easily getting stuck in local minim in conventional Hopfield neural networks. It can reach a stable convergent state after shortly reversed bifurcations. Numerial simulation of protein contact maps problem show that the TCNN has higher ability to search for globaiiy optimal or near-optimal solution and higher efficiency of searching than HNN. The method could assign protein contacts with an average accuracy of 0.27 and with an improvement over a random predictor of a factor greater than 9. (5)This dissertation proposed a protein contact map prediction method employing protein folding rules and clonal selection algorithm, which has removed the limit of the present protein structure database by inducing the independent constraint rules from the contact maps’ characteristics, and gets a satisfactory precision.Immune algorithm is a rising algorithm which simulates the organism immune system by computer. There is a kind of immune algorithm named clonal selection algorithm, which is widely used due to its adaptability, implicit parallelism and diversity. Clonal selection algorithm is generated by simulating the antibody producing model. In the immune system, each antibody is cloned at a speed based on its affinity to the entered antigen, and then mutates at a high frequency to generate a more adaptive antibody, which finally lead to the optimum solution. Thus the fitness of the clonal selection algorithm shows this affinity between antibody and antigen. A fitness function is constructed in this paper by using protein folding restrictions, such as:Amino acids’ hydrophobicity rule,Secondary structure folding rules of protein, Amount of the contacts in contact map, Degree of vertex, and Other special rulesGiven the midway solution generated by the clonal selection algorithm penalty which subjects to the restrictions above, the more it breaks the rules, the less feasible it is for the real world, and the more penalty it will get, thus it will have a higher probability of mutation in order to produce a new solution more accordant to the protein biological characteristics in next iteration, which actually optimized the prediction.The testing of the prediction of 200 non-homological protein in 5 groups of different lengths shows that, this algorithm has good adaptability and high efficiency, and the average precision and coverage of each group is higher than 40% and 35% respectively. Moreover, the precision and coverage differences between groups are less than 4%. Although the results of tests differ a lot at the thresholds from 6 to 10 angstroms, their mean precision is still greater than 35%. Meanwhile, the execution time of a contact map prediction is not more than 2 minutes, with a mean value about 100 seconds.This dissertation is based on the National Natural Science Foundation of China“Research on relevant combinatorial theory and algorithm inbioinformatics”( No. 60433020), Science &Technology Development Project of Jilin Province“Research on prediction of protein structure and function based on evolving algorithm”(NO.20020608),Innovative Foundation Project of Jilin university”research on the method protein structure prediction”(NO.450011022211).the achievements of this dissertaion make applied research on computational intelligence theory progress,is of significance in the fields of recurrence neural network, transiently chaotic neural network, clonal selection algorithm,and provides effective methods and means for practical research of machine taste and smell sensation.

  • 【网络出版投稿人】 吉林大学
  • 【网络出版年期】2007年 03期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络