节点文献

蛋白质二级结构的预测以及二级结构与三级结构之间关联的探讨

A Study on the Protein Secondary Structure Prediction and the Connection between Protein Secondary Structure and Its 3D Structure

【作者】 冯永娥

【导师】 罗辽复;

【作者基本信息】 内蒙古大学 , 生物物理学, 2008, 博士

【摘要】 蛋白质的生物功能以其结构为基础。随着人类基因组计划的顺利实施,蛋白质序列信息的积累速度远快于蛋白质结构数量的增长速度。实验上研究蛋白质结构的主要手段有X射线晶体学技术、核磁共振衍射技术、电子纤维技术等。然而,通过实验手段确定蛋白质的结构,不但成本高、耗时,而且实验中还会遇到一些目前无法解决的技术困难,因此人们非常希望利用理论计算的方法直接从序列信息出发来预测蛋白质结构,这是生物信息学研究的重要课题之一。目前,直接从氨基酸序列信息出发来预测蛋白质三级结构还是有很多困难。更多的焦点集中在去预测蛋白质二级结构。由于二级结构单元是多肽链在三维空间折叠的基本元素,二级结构预测通常作为蛋白质空间结构预测的第一步,是蛋白质三级结构预测中重要的中间步骤,也是蛋白质折叠理论研究的重要挑战。本文重点介绍了一种新的方法,即基于4肽结构字的多样性增量二次判别法(简称TPIDQD算法),对2个大小不同的数据库进行了二级结构的预测。同时对325个标准样本集合,进行了二级结构和三级结构关联的研究。(1)新的预测算法大体分三步:首先用定义的三种4肽结构字(alpha、beta、coil)在序列中出现的频次作为多样源,从而建立标准源;然后用多样性增量结合二次判别法对任何一个序列片段中心残基的二级结构进行预测;最后进行一些修正后处理,包括:消除预测中的结构涨落以及用4肽边界字来修正预测后的结构边界。(2)用TPIDQD算法首次对CB513数据库的二级结构进行了预测,3折交叉检验的预测精度Q3达到79.19%。(3)建立了一个新的包括1645个非冗余蛋白质链的数据库,其中蛋白质结构分辨率高于3 Angstroms,序列相似性小于25%。用TPIDQD算法对其中21残基片段中心残基的结构性质进行预测,10折交叉检验得到Q3为79.68%。当考虑长程序列信息时,即取更长的序列片段(大于21残基长度)来预测时,结果将更好。同时随着字库的扩大,用CB513库作为训练集,对1645蛋白库的交叉检验,也取得了79%的精度。(4)对325个蛋白的二级结构和其三级结构的关联进行了研究。我们利用广义的二级结构序列信息,定义了两个蛋白之间的距离,和用相似分表示的两个蛋白的三级结构的距离进行了相关性分析。结果发现在排除了长度的依赖性后,在灵敏度α=0.05和α=0.01上,有300个相关系数是高于阈值的。

【Abstract】 The knowledge of the structure of a protein is important to understand its function. With the success of human genome project, a widening gap appears between rapidly increasing known protein sequences and slow accumulation of known protein structures. Currently, the main methodologies for high-resolution protein structure determination in experimentation have been available, such as X-ray crystallography, NMR, electron microscopy etc. However, purely experimental approaches for the determination of protein structure are time-consuming and expensive. Thus, the theoretical or computational methods for predicting the structures of proteins become increasingly important.Presently, the direct prediction of the protein three-dimensional (3D) structure from its amino acid sequence is a difficult task. A large number of approaches have been developed to predict protein secondary structure. Protein secondary structure prediction is often looked as the first step for understanding and predicting tertiary structure because secondary structure elements constitute the building blocks of the folding units. So, the prediction of protein secondary structure as an intermediate step plays an important role in tertiary structure prediction.In this dissertation, we introduce a novel sequence-based method, namely tetra-peptide-based increment of diversity with quadratic discriminant analysis (TPIDQD for short), for protein secondary structure prediction in two different dataset. Moreover, we investigate the connection between protein secondary structure and its 3D structure for 325 proteins.(1) The proposed TPIDQD method consists of three steps: firstly, using the frequency of three kinds of tetra-peptide structural words occurring in a sequence fragment as diversity; secondly, using the method of increment of diversity combined with quadratic discriminant analysis (IDQD for short) to predict the structure of central residues for a sequence fragment; finally, making the correction to the IDQD prediction: removing the structure fluctuation and correcting the structure boundary by using tetra-peptide boundary words.(2) The proposed TPIDQD method is based on tetra-peptide structural words and used to predict the structure of central residue for a sequence fragment. The three state overall per-residue accuracy (Q3) has attained 79.19% in the three-fold cross-validated test for 21-residue fragments in CB513 dataset(3) An enlarged dataset is constructed, which contains 1645 protein chains with higher resolution than 3 Angstroms and lower identity than 25%. The TPIDQD method is tested in 1645 protein dataset and a higher accuracy is obtained. The three state overall per-residue accuracy (Q3) is 79.68% in the ten-fold cross-validated test for 21-residue fragments. And the accuracy can be further improved as taking long-range sequence information (>21-residue fragments) into account in prediction. Moreover, the accuracy Q3 has attained 79% in the independent test set with the increase of structural words.(4) We have investigated the relation between protein secondary structure and its 3D structure for 325 samples and obtained a better result.

  • 【网络出版投稿人】 内蒙古大学
  • 【网络出版年期】2009年 02期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络