节点文献

蛋白质相互作用及其结合面热点残基的预测方法研究

Computational Prediction of Protein-protein Interactions and Hot Spot Residues in Protein Interfaces

【作者】 夏俊峰

【导师】 黄德双;

【作者基本信息】 中国科学技术大学 , 生物信息学, 2010, 博士

【摘要】 随着人类基因组和其它物种基因组序列测定计划的顺利完成,生物学的研究从基因组时代步入后基因组时代。作为后基因组时代的重要研究领域之一的以蛋白质间相互作用研究为中心发展起来的蛋白质组学已经成为当今生命科学研究的热点和前沿领域。研究细胞内所有蛋白质的相互作用即相互作用组,分析各种蛋白质复合物的组成及其作用方式对于我们理解生物体的复杂运行机制至关重要。在过去的几年时间里,研究人员从计算角度出发,提出了很多的生物信息学方法来研究蛋白质相互作用。在这些方法之中,基于蛋白质序列的预测方法得到了极大的关注。这类方法不需要先验知识,可以广泛地用于蛋白质相互作用的研究之中。同时,蛋白质序列的测定速度远远大于蛋白质结构的实验鉴定速度。因此,利用蛋白质的序列信息来预测蛋白质之间的相互作用是一种非常理想的计算方法。本文从蛋白质序列出发,利用支持向量机和集成学习等机器学习方法来预测蛋白质相互作用。此外,我们还研究了对保持蛋白质的功能和蛋白质复合物结构的稳定性起着关键作用的热点残基。全文的主要工作概括如下:1.提出了一种基于氨基酸序列自相关描述符与旋转森林的蛋白质相互作用预测方法。自相关描述符刻画了在蛋白质序列上相隔一定距离的两个残基之间的相互作用,因此这种编码方式考虑到了氨基酸的邻域环境,可能会揭示整个序列上与蛋白质相互作用有关的模式。我们首先把氨基酸符号序列转换成理化属性表示的数值序列,然后利用自相关描述符把这些长度不等的蛋白质数值序列转换为一系列长度相同的矢量。最后我们应用旋转森林预测蛋白质相互作用。旋转森林是新近设计出的一种集成学习算法,可以同时提高集成分类器系统中的单分类器准确性和多样性。实验结果表明,我们的方法能够有效地预测蛋白质相互作用,在酵母和幽门螺杆菌数据集上均取得了理想的识别效果。2.提出了一种基于氨基酸序列分段局部描述符与支持向量机的蛋白质相互作用预测方法。蛋白质相互作用的一个重要特征是相互作用经常发生在序列上的间断区域,在这些区域中,那些序列上相距较远的残基通过蛋白质的折叠从而在空间上相距很近。基于氨基酸序列分段局部描述符考虑到了这种序列上相距较远残基之间的相互作用关系。我们首先将蛋白质序列划分为长度和组成可变的十个局部序列片段,然后再通过局部描述符来编码每一个局部序列片段。所以这种方法可以捕获多个相互重叠的序列上连续和间断的结合模式。在基于这种编码策略的支持向量机预测模型上的实验结果表明我们的方法能有效提高蛋白质相互作用的预测结果。3.构建了一个元学习方法模型来预测蛋白质相互作用。在我们上述提出的两种特征编码方法基础上,我们又根据相关的研究报道,选择了四种性能良好的编码方法。然后通过这些不同的特征编码方法结合支持向量机建立了六种基于蛋白质序列的相互作用预测单分类器模型。在这些性能优异的单分类器模型基础上,我们构建了基于元学习方法的蛋白质相互作用预测集成学习系统。结果表明元学习方法模型能够使预测性能获得较大的提升。此外,我们的模型在跨物种数据集上也表现出了良好的性能。4.提出了一种基于氨基酸溶剂可及性和突出指数的相互作用结合面热点残基预测方法。在应用计算方法来研究蛋白质相互作用结合面热点残基时,如何选择有效的生物特征是需要解决的关键问题。我们首先从蛋白质序列和结构出发,提取了一系列与热点残基可能相关的生物特征。然后通过特征选择,构建了九个基于单一特征的支持向量机分类模型。最后,为了进一步提高热点残基预测的精度,我们使用了简单的多数投票表决法来对这九个模型的输出进行了集成决策处理。我们的研究表明氨基酸残基的溶剂可及性和突出指数是热点残基预测中的主要判别特征。在这里,我们是首次应用氨基酸残基的突出指数来对热点残基进行预测。实验结果证实了我们的方法能更加有效地对热点残基进行分类,在预测精度上有着显著性的提高。

【Abstract】 With the complement of the sequencing human and other species genome, the study of biology has been gradually transferred from the genomics era to the post-genomics era. As one of the most important field of post-genomics era, proteomics developed by focusing on the study of all possible protein-protein interactions (PPIs) in a cell has become the hot topic and fronter of life science. The studies of PPIs can help us to understand essential mechanisms of life processes.So far a number of computational methods have been explored for the large-scale prediction of PPIs. Among these methods, a unique category of protein sequence-based prediction methods attracted much attention. The accuracy and reliability of these methods do not depend on the prior information of the protein pairs. Due to the limited availability of three dimensional structures of proteins and the rapid increase of the number of protein sequences, the approaches that use amino acid sequence information alone to guide the discovery of PPIs are of particular interest. Therefore, the current study is to seek machine learning techniques such as support vector machine (SVM) and multiple classification system to predict PPIs from sequences. In addition, we also introduce an ensemble learning method with SVMs to predict hot spot residues, which are observed to be crucial for preserving protein function and maintaining the stability of protein association. The main works in this thesis can be introduced as follows:1. A sequence-based approach was proposed to predict PPIs by combining a new feature representation using autocorrelation descriptor with rotation forest. Autocorrelation descriptor accounts for the interactions between amino acid residues within a certain distance apart in the sequence, so this descriptor adequately takes the local environments of amino acids effect into account and makes it possible to discover patterns that run through entire sequences. The amino acid sequences were firstly translated into numerical values representing six physicochemical properties, and then these numerical sequences were converted into a serious of fixed-length vectors by autocorrelation descriptor. Finally, the rotation forest was constructed using these vectors as input. Rotation forest is a newly proposed robust ensemble system, which can enhance the accuracy and the diversity for single classifiers in the ensemble simultaneously. Experimental results on Saccharomyces cerevisiae and Helicobacter pylori datasets show that our proposed approach outperforms those previously published in literature, which demonstrates the effectiveness and efficiency of the proposed method.2. A method based on novel representation of local protein sequence descriptor and SVM was presented to infer PPIs. One particular feature of protein interaction is that the interactions usually occur in the discontinuous regions in the protein sequence, where distant residues are brought into spatial proximity by protein folding. In the current study, a novel representation of local protein sequence descriptor was used to involve the information of interactions between distant amino acids in the sequence. A protein sequence was characterized by ten local descriptors of varying length and composition. So this method is capable of capturing multiple overlapping continuous and discontinuous binding patterns within a protein sequence. As expected, the experimental results show that our SVM-based predictive model with this encoding scheme is an important complementary method for PPIs prediction.3. A public meta predictor was constructed to infer PPIs using only the information of protein sequence. Besides the foregoing two feature representation methods (i.e. autocorrelation descriptor and local descriptor), additional four methods were selected according to their prediction accuracy in previous studies. We then built six sequence-based individual classifiers by combining different feature representation methods and SVMs. Finally, we adopted another SVM as the meta predictor to integrate the prediction decision values of these excellent component predictors. The results demonstrated that our meta predictor is promising. In addition, we used the final prediction model trained on the PPIs dataset of S.cerevisiae to predict interactions in other species. The results reveal that the meta model is also capable of performing cross-species predictions.4. A feature-based method that combines protrusion index with solvent accessibility was presented for accurate prediction of hot spots in protein interfaces. Up to now, the biological properties that are responsible for hot spots have not been fully understood. Consequently, the features previously identified as being correlated with hot spots are still insufficient. We first extracted a wide variety of features from a combination of protein sequence and structure information. And then we performed feature selection to remove noisy and irrelevant features, and thus improved the performance of the classifier. After extensive feature selection, nine individual-feature based predictors were developed to identify hot spots using SVMs. Finally, we employed an ensemble classifier approach, which further improved prediction accuracies of hot spots. To demonstrate its effectiveness, the proposed method was applied to two benchmark datasets. Empirical studies show that our method can yield significantly better prediction accuracy than those previously published in the literature.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络