节点文献
蛋白质结构预测模型研究
Study on Models of Protein Structure Prediction
【作者】 罗亮;
【导师】 许进;
【作者基本信息】 华中科技大学 , 系统分析与集成, 2010, 博士
【摘要】 近20年来,随着生物数据量呈指数级增长,产生了新的交叉学科——生物信息学。而蛋白质结构与功能预测是生物信息学的一项核心研究内容,它的研究不仅能帮助人们了解蛋白质折叠的形成机制,更对实验生物学起着重要的指导作用。蛋白质结构预测的关键在于建立有效的预测模型并给出合理快速的预测算法,然而蛋白质空间结构复杂,各种结构的形成原因并不完全清楚,因此目前的预测模型和算法都有各自的局限性,预测模型的准确度和算法求解的复杂度之间也互相制约。针对这些问题,本文进行了深入的研究,提出和改进了一些蛋白质结构预测模型及方法。图论在蛋白质结构预测相关问题的研究中有着重要作用。本文将预测蛋白质二级结构问题转换成求解一个图的最短路径问题,每3个顶点表示序列中的一个氨基酸残基可能形成的二级结构,边表示可能的残基连接,并设计一个函数对边进行赋权,则这个赋权图中的最短路径对应该蛋白质的二级结构。应用这个方法,对几组测试集进行了预测,取得了较好的预测结果,并对模型中环境参数的选取进行了讨论。蛋白质序列数据的冗余是训练蛋白质结构预测模型需要避免的问题。本文将图论中最大团的概念引入冗余处理的算法中,利用最大团的成熟算法改进了蛋白质数据冗余的处理方法,并对几种蛋白质数据进行了处理,取得比较好的结果。DNA计算是一种全新的计算模型,本文试图将DNA计算引入到蛋白质结构预测中,建立了蛋白质结构预测的质粒DNA计算模型,为蛋白质结构预测提出一种全新的研究思路。该模型首先将一段待确定空间构型的侧链或主链转换成一个赋权图的顶点,顶点和边根据一些安排好的标准赋权,然后结合最大权团问题的质粒DNA计算模型,建立蛋白质预测问题的DNA计算模型,最后对该质粒DNA计算模型的编码进行了研究,给出了一个编码工具。概率图模型是蛋白质结构预测的一类有效的模型。本文将20种氨基酸进行分类,通过统计β折叠的典型形成模式,将3-状态隐马尔可夫预测模型扩展为9状态,有效的提高了β折叠的预测精度。条件随机场是最近提出的一种概率图模型,本文构建了一种基于条件随机场的蛋白质结构预测模型,并给出了此类条件随机场的训练及解码算法。同时利用多序列对比程序PSI-BLAST把蛋白质序列转化为表示进化信息的序列模体以提高预测的精度,最后给出预测结果并进行比较分析。在蛋白质结构预测的研究中,一个重要的问题就是正确预测二硫键的连接,二硫键的准确预测可以减少蛋白质构型的搜索空间,有利于蛋白质的3D结构的预测。本文成功地将LVQ神经网络方法引入蛋白质的二硫键的预测工作中。结果表明蛋白质的二硫键的连接与半胱氨酸的局域序列模式有重要联系,可以由蛋白质的一级结构序列预测该蛋白质的二硫键的连接方式,应用这个方法对蛋白质结构的二硫键进行了预测取得了良好的结果。HP模型是一种简化的蛋白质结构预测模型,本文对HP模型进行改进,根据氨基酸残基的亲疏水特性以及理化特性将氨基酸残基分为4类,把蛋白质序列简化为一个4元序列,并给出一种通过4元序列能量最低的结构来预测蛋白质的空间结构的简化模型。最后使用一种改进的模拟退火算法对4种不同长度的蛋白质进行二维结构预测,比过去HP模型得到了更小的能量构型,说明该简化模型比HP模型更加精确。同时该方法也可以应用于蛋白质的三维结构预测。
【Abstract】 Exponentially exploding bioinformatics data has brought a new multidisciplinary research area-bioinformatics. One of major research issues in bioinformatics is on protein structure prediction based on protein sequence. This interdisciplinary field begs for knowledge of mathematics, computer science, information science, physics, system science, management science as well as biology. Concerning the problem of protein structure prediction, some new models and improved models are given in this dissertation.Graph theory plays a key role in the field of prediction of protein structure. In this dissertation, a method based on the shortest path of a graph is proposed. Three vertices of the graph give a possible secondary structure of a residue, and each edge of the graph is assigned a weight by a function. This path equated the corrected secondary structure. By this method, Several groups of proteins is tested and the result showed that this method was feasible. Finally the selection of parameter is discussed.DNA computing is a new computer model. This dissertation introduces DNA computing in proteins structure prediction. Each possible conformation of a residue in an amino acid sequence is represented using the notion of a node in a graph. Each node is given a weight based on the degree of the interaction between its side-chain atoms and the local main-chain atoms. Proteins structure prediction problem is mapped to find the maximal sets of completely connected nodes (cliques) in a graph and then using DNA computing model can find the maximal cliques.Probabilistic graphic model is an effective protein structure prediction model. By introducing a hidden state variable, a hiden Conditional Random Fields (HCRFs) is builded and used in the problem of protein structure prediction. A method of constructing the model and the algorithms is given to train and decode the model and use the model to predict the second structure of a famous protein dataset (CB513). Finally the results are compared with some other methods.An important problem in protein structure prediction is the correct location of disulfide bonding in proteins. The location of disulfide bonding can strongly reduce the search in the conformational space of protein structure. Therefore the correct prediction of the disulfide bonding starting from the protein residue sequence may also help in predicting its 3D structure. In this paper the LVQ artificial neural network method is applied to predict the disulfide bonding of protein structure. The local sequence arrangement of cysteine is of great significance to the disulfide bonding. Therefore the disulfide bonding can be predicted by its primary structure. This method was used to predict disulfide bonding in protein structure and a fine result was got.HP model is a simplified model of protein structure prediction.20 kinds of protein residues is classed into four groups. A protein sequence is converted to a new sequence including four alphabets. And then by searching the lowest energy of the new sequence we construct a protein structure prediction model. Simulated annealing algorithm is used for this model and the result gets the lower energy than using the HP model. The model can extend in predicting protein structure in 3D.
【Key words】 protein structure prediction; the shortest path; DNA computing; maximal clique; conditional random fields; disulfide bonding;