节点文献

融合计算智能的蛋白质结构预测研究

Research on Prediction of Protein Structures Integrated Computational Intelligence

【作者】 刘君

【导师】 熊忠阳;

【作者基本信息】 重庆大学 , 计算机系统结构, 2011, 博士

【摘要】 后基因组时代生命科学中最重大的研究课题之一是蛋白质组研究,对蛋白质空间结构预测研究在整个蛋白质组计划中占有着极其重要的地位。蛋白质预测结构研究内容包括:序列预处理、二级结构预测、超二级结构预测、关联图预测、三级结构预测等。本文对其中的序列预处理、二级结构预测及关联图预测进行了深入研究。蛋白质序列由DNA序列翻译而来, DNA序列的质量高低决定了蛋白质结构预测的准确性。现有的DNA序列预处理工具对垃圾碱基信息的过滤和清除效率不高,且随着DNA序列长度的增加出错概率会显著升高。因此,本文对DNA序列的预处理进行了研究。BP神经网络广泛应用在蛋白质二级结构预测中,但是BP算法有其明显的缺陷,如训练速度慢、容易陷入局部极值等,这对蛋白质二级结构预测精度有重要影响,因此有必要对应用于蛋白质结构预测的神经网络算法进行改进;目前二级结构预测研究在特征表达上有缺陷,仅仅考虑氨基酸基本组成成份,特征信息表达不完整,忽略氨基酸疏水性特征以及氨基酸之间的长程作用,因此,研究基于更完善特征表达的蛋白质二级结构分类方法是有必要的。蛋白质的三维空间结构与其功能紧密相关,目前,从蛋白质二级结构直接预测三维空间结构非常困难,蛋白质关联图预测是蛋白质三级结构和二级结构之间的重要桥梁,因此蛋白质关联图预测有着重要的研究意义。论文取得的主要成果与创新工作概括如下:①提出一种新的融合智能检测的DNA序列预处理方法,它不需要预先给出载体序列、剪接位点和克隆适配片段等信息,通过统计分析、随机搜索和图操作等方法自动发现并定位垃圾信息。此新方法可以作为组件工具供DNA序列数据处理管道系统调用。②提出一种用于蛋白质二级结构预测的改进型动态隧道神经网络算法。神经网络具有容易陷入局部极小的缺点,动态隧道神经网络通过“钻隧道”方式,让目标函数跳出局部最小,找到更小的可行域,从而避免神经网络陷入局部极小。传统的动态隧道技术隧道方向单一并且随意,因此具有不稳定性。为了有效提高动态隧道的搜索效率,提出了一种改进型动态隧道神经网络算法。该算法增加搜索的隧道数,引入夹角弹性系数控制隧道方向,考察隧道之间的相互影响。在蛋白质二级结构预测实验中,改进型动态隧道神经网络算法预测的效果优于神经网络算法和传统的动态隧道神经网络算法。③针对氨基酸疏水性特征以及氨基酸之间的长程作用在蛋白质二级结构预测中的影响进行了比较试验分析。目前采用机器学习进行蛋白质二级结构预测的方法,忽略氨基酸疏水性特征以及氨基酸之间的长程作用,因此准确率不高。用氨基酸对应的疏水能值替换蛋白质中相应的氨基酸,可以得到一个疏水能值的序列。实验中发现,用长的疏水能值序列,训练BP网络,对长程作用起主导的E结构(β-折叠)的预测效果好。④基于比较完善的蛋白质特征表达提出Co-training算法。比较试验分析表明,氨基酸的长程作用在二级结构预测中对E结构(β-折叠)有重要的作用。因此,提出基于Profile编码特征和疏水能值特征两个独立冗余视图的Co-training算法。该算法的主要步骤为:在Profile特征空间训练SVM分类器,在疏水性特征空间训练BP神经网络分类器,协同对氨基酸二级结构进行预测;对SVM分类器和BP分类器有分歧的样本,基于主动选择思想,给予两个分类器不同的优先级进行仲裁。实验表明,Co-training方法有较高的准确性,对长程作用起主导的E结构(β-折叠),短程作用起主导的H结构(α-螺旋)预测准确率都有提高。⑤首次将马尔科夫逻辑网应用到蛋白质关联图预测研究中。Markov逻辑网是将Markov网与一阶逻辑相结合的一种全新的统计关系学习模型,该方法可以计算出世界的概率分布,进而为推理服务。本文利用该方法的这一优点,将蛋白质关联图预测问题形式化。具体采用了判别式训练的学习算法和MC-SAT推理算法,并详细阐述了如何用少量的谓词公式来描述蛋白质关联图预测中不同方面的本质特征,将Markov逻辑表示的各方面组合起来形成各种模型。实验结果表明基于Markov逻辑网的蛋白质关联图预测方法可以取得比基于神经网络的方法更好的效果,从而为Markov逻辑网解决实际的预测问题提供了有效途径。

【Abstract】 Proteomics is becoming an important research domain in the life science with the approach of post-genome area. Prediction of protein structures research takes a significant role in the whole proteomics plan. The content of protein structures prediction research includes sequences preprocessing, protein secondary structures prediction, protein supersecondary structures prediction, protein contact maps prediction, protein 3D structures prediction, etc. This paper made an intensive study on sequences preprocessing, protein secondary structures prediction and protein contact maps prediction.Protein sequences are translated from DNA sequences, so the quality of DNA sequences is an important factor to prediction accuracy of protein structures. Existing DNA sequences preprocessing tools are still not efficient in noise segments filtering and cleaning. The probability of error will increase significantly with the increasement of the length of DNA.Thus, this paper made research in DNA sequences preprocessing.BP neural networks have been widely used in protein secondary structures prediction, but they have some defects, such as slow convergence speed and local optimum traps. These defects influence the accuracy of protein structures prediction and need to be improved. Meanwhile, existing available methods for protein secondary structures prediction are limited on feature representation. Only basic compositon of amino acids is considerd in these methods as a result they are incapable of representing necessary information completely. The hydrophobicity of amino acids and interaction between amino acids which are far away from each other have been ignored.In this paper,an improved classification method for protein secondary structures predition based on more complete feature representation need to be furtherly explored.The 3D structures of proteins are tightly associated with specific functions. Nowadays, it is very difficult to predict the 3D structures from the secondary protein sequences. Protein contact maps are possible connecting ties between 3D structures and secondary structures. There is thus a need to predict the contact maps of proteins.The main contributions of this paper are summarized as follows:First of all, a novel DNA preprocessing method merging intelligent detection is proposed. This approach finds and locates contaminants automatically using statistical methods, random search and graph-theoretic operations but with no extra background information such as vector sequences, splice sites and clone adapters. This new method can be applied in the DNA data processing pipe as an independent component tool.Secondly, an improved dynamic tunneling neural network algorithm, which is applied in protein secondary structures prediction, has been proposed. Neural networks suffer from a defect of easily immersing in local traps. The dynamic tunneling technique helps neural networks to eliminate the local traps by“tunneling”and jumping into lower valleys of object function. However, the traditional dynamic tunneling technique tries to search in a random and single direction, thus it is instable. In order to improve the searching efficiency, an improved dynamic tunneling neural network algorithm has been proposed to enhance the stability by increasing the directions of tunneling and controlling the interaction between trajectories of the tunneling system with an angle spring coefficient. Experimental results show that the improved algorithm outperforms both the traditional neural network and the traditional dynamic tunneling neural network in the prediction of protein secondary structures.Thirdly, comparative experiments, which test the influence of the amino acid hydrophobic property and the interaction between far away amino acids in protein secondary structures prediction, have been implemented. Existing machine learning based protein secondary structures prediction methods suffer from low prediction accuracy because they ignore the amino acid hydrophobic property and the interaction between far away amino acids. A sequence of hydrophobic value can be built by replacing the amino acid by its hydrophobic energy value. Experiments show that the BP neural network using long amino hydrophobic energy value sequences works well in prediction of E structure (β-strand) which is controlled mainly by long amino acid-amino acid interaction.Fourthly, this paper proposes a Co-training algorithm based on different protein features. The comparative experiments show that the long amino acid-amino acid interaction plays a significant role on predicting E structure (β-strand). Therefore, a Co-training algorithm is explored which is based on both the profile space and the hydrophobic energy value space. They are sufficient and redundant views. In the proposed algorithm, there are two classifiers. One is the SVM classifier trained in the profile space, and the other is the BP neural network classifier trained in the hydrophobic energy value space, and they predict one amino acid’s secondary structure independently. If these two classifiers have different prediction results with one amino acid, an arbitration rule proposed in this paper is employed to make the final decision which is based on an active selecting strategy according to the two classifiers’different priority levels. The experimental results show that the proposed algorithm has higher prediction accuracy both in E structure (β-strand) which controlled mainly by the long interaction and H structure (α-helix) which controlled mainly by the short interaction than existing algorithms.Fifthly, Markov Logic Networks are applied in protein contact maps prediction first time. Markov Logic Networks (MLNs) are new Statistical Relational Learning models in which Markov networks and first-order logic are combined together. They are able to compute the probability distribution of worlds and serve for the inference. In this paper, we introduce the theory, learning methods and inference algorithms of Markov Logic Networks and then apply them to the protein contact maps prediction. This research adopts discriminative learning algorithm for Markov Logic Networks weights learning, MC-SAT algorithm for inference. This paper also shows how to capture the essential features of different aspects in protein contact maps prediction with a small number of predicate rules and how to combine these rules together to compose different models. It is proved that the method based on Markov Logic Networks is better than the way based on conventional neural networks in protein contact maps prediction by experimental results.This research provide a new solution for such kind of practical prediction problems.

  • 【网络出版投稿人】 重庆大学
  • 【网络出版年期】2012年 07期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络