节点文献

蛋白质的β-发夹、β(γ)-转角及四类简单超二级结构预测

Prediction of the β-Hairpins, β(γ)-Turns and Four Kinds Simple Super-secondary Structures in Proteins

【作者】 胡秀珍

【导师】 李前忠;

【作者基本信息】 内蒙古大学 , 理论物理, 2007, 博士

【摘要】 由于蛋白质的功能与其结构是密切相关的,因此研究蛋白质的结构是获取功能信息的重要手段。随着人类基因组计划的顺利实施,蛋白质序列信息的积累速度远快于蛋白质结构数据的增长速度。然而,通过实验手段确定蛋白质的结构,不但成本高、耗时,而且实验中还会遇到一些目前无法解决的技术困难,因此人们非常希望能利用理论计算的方法直接从序列信息预测蛋白质结构,这也是生物信息学研究的重要课题。目前,直接从序列信息预测蛋白质的三级结构还很困难。由于局域结构有着较强的序列信号,且在三级结构中大量存在、频繁出现,对蛋白质的折叠、识别和稳定性起重要作用,因此,局域结构的预测可以简化结构预测问题,是蛋白质三级结构预测重要的中间步骤。本文主要研究蛋白质局域结构中超二级结构的预测,重点研究β-发夹模体的预测;研究了部分规则二级结构中β-转角和γ-转角的预测。1.提出了一种新的预测算法一基于离散增量的支持向量机算法,用该算法首次对超二级结构数据库(ArchDB40)中β-发夹模体进行了预测,取得较好效果。2.利用离散增量和序列打分值构成的向量来表示序列信息,将离散增量和打分值作为向量输入支持向量机,在向量空间中寻找最优超平面,提出了一种新的组合向量预测算法。该算法首次应用于β-发夹模体的预测,对ArchDB40超二级结构数据库中β-发夹数据集和文献(Kumar and Bhasin,Nucleic Acids Research,2005,33:154-159)中已有的β-发夹数据集的预测结果显示,我们的算法可以实现比以往方法更高的预测成功率。与文献中已有数据集的预测结果相比,对独立的检验集预测精度提高4%,β-发夹的敏感性提高6%。另外,将这种算法首次用于ArchDB40数据库中的四类简单超二级结构分类,无论是对5-交叉检验的训练集,还是对独立的检验集都取得较好分类结果。3.在离散增量和序列打分值的基础上,进一步把预测的二级结构信息加入组合向量,将它们共同输入支持向量机,对普遍使用的,分别包含426个和320个蛋白质序列的两数据集中的部分规则二级结构β-转角和γ-转角进行了预测。结果指出,对β-转角的7-交叉检验预测精度达到79.8%、相关系数为0.47:对γ-转角5-交叉检验预测的相关系数达到了0.18,这些结果都是目前最好的预测结果。4.建立了一个新的包括2208个非冗余蛋白质链的数据库,蛋白质结构分辨率高于2.5(?),序列相似性小于40%。得到α-α模体6799个,α-β模体6711个,β-α模体6072个和β-β模体8163个,首次将最小离散增量算法用于蛋白质四类简单超二级结构预测,当序列模式固定长取8个氨基酸残基,对“822型”序列模式3-交叉检验的平均预测精度达到78%,Jack-knife检验的平均预测精度达到76.8%;当序列模式固定长取10个氨基酸残基,对“1041型”序列模式3-交叉检验的平均预测精度达到83%,Jack-knife检验的平均预测精度达到79.8%。5.在蛋白质简单超二级结构分类预测、β-发夹预测、β-转角预测及γ-转角的预测工作中,引入了二肽组分信息参数和亲疏水特征信息参数,改善了预测结果。

【Abstract】 The knowledge of the structure of a protein is important to understand its function. With the success of human genome project,a widening gap appears between rapidly increasing known protein sequences and slow accumulation of known protein structures. Determination of protein structure purely using experimental approaches is time-consuming and expensive.Thus,the theoretical or computational methods for predicting the structures of proteins become increasingly important.Presently,the direct prediction of the protein three-dimensional(3D) structure from sequence is a difficult task.But local structural motifs are with strong sequence signals,and commonly present in the 3D structures,and governing the stability and fold of proteins. Therefore,predicting local structure may help to simplify structure prediction problem,which is a key step of predicting 3D structure.In this dissertation,we investigated the super secondary structure prediction of proteins, especiallyβ-hairpin motifs.In addition,β-turns andγ-turns of secondary structures in the proteins also studied.1.Based on the algorithm of the least increment of diversity,a new algorithm of the increment of diversity combined with support vector machine(ID_SVM) is proposed,to predict theβ-hairpins in the ArchDB40 dataset.And better results are obtained.2.By using of the composite vector with increment of diversity and scoring value to express the information of sequence,and inputting the increment of diversity and scoring value to Support vector machine(SVM),SVM can find the optimization hyper plane in vector space to classify theβ-hairpins and the non-β-hairpins.A new algorithm of the increment of diversity and scoring value combined with support vector machine (ID_PCSF_SVM) for predictingβ-hairpin motifs in the ArchDB40 dataset and EVA dataset (Kumar and Bhasin,Nucleic Acids Research,2005,33:154-159, http://cubic.bioc.columbia.edu/eva/index.html) is proposed.And higher predictive success rates than the previous algorithms are obtained.The overall accuracy of prediction is improved 4%,and sensitive forβ-hairpin is increased 6%.We also applied our method to predict super secondary structure of the ArchDB40 dataset,and better results are obtained for training set 5-fold cross-validation and independent testing set.3.The increment of diversity,scoring value and predictive secondary structure information together are selected as inputting parameters of the SVM.A new algorithm for predictingβ-turns in the 426 proteins andγ-turns in the 320 proteins is proposed.The overall prediction accuracy and Matthews’s correlation coefficient(Mcc) in 7-fold cross-validation are 79.8%and 0.47,respectively,for theβ-turns.And the Mcc in 5-fold cross-validation is 0.18 for theγ-turns.4.A database is constructed,which contained 2208 protein chains with higher resolution than 2.5(?) and lower identity than 40%.They contain 6799α-α,6711α-β,6072β-αand 8163β-βmotifs.Based on the diversity increment algorithm,the four types super-secondary structures are predicted by the 3-crossvalidation test.And results show that average prediction accuracy are 78%in the 3-crossvalidation test and 76.7%in jack-knife test for the "822type" for fixed-length pattern with 8 amino acids.If using of the "1041type" for fixed-length pattern with 10 amino acids,prediction accuracy are 83%and 79.8% respectively.5.By using the information of the dipeptide composition and amino acid hydropathy distribution,the predictive results for super secondary structures,β-hairpins、β-turns andγ-turns and is improved.

  • 【网络出版投稿人】 内蒙古大学
  • 【网络出版年期】2009年 03期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络