节点文献

基于机器学习的蛋白质结合位点特征化和预测方法研究

The Study of Characterization and Prediction of Binding Sites on Proteins Based on Machine Learning Methods

【作者】 熊毅

【导师】 刘娟;

【作者基本信息】 武汉大学 , 计算机应用技术, 2011, 博士

【摘要】 随着人类基因组和许多其它物种基因组序列测序计划的成功完成,不断增长的基因组序列数据提供了数百万条蛋白质的编码信息。作为遗传信息的体现者,蛋白质是最主要的生命活动过程的载体和功能执行者。在生物体细胞中,蛋白质是通过与其它生物分子相互作用来完成特定的功能,但直接参与了与其它生物分子相互作用的残基只占有蛋白质上的一部分,这些结合位点对实现蛋白质的功能显得十分重要。因此,分析和识别蛋白质-其它分子结合位点成为研究蛋白质功能实现机制的基础。近十年来,研究者开始关注利用计算方法预测蛋白质上的功能残基,特别是基于机器学习的预测方法,从蛋白质的序列或结构信息出发预测功能残基。本文使用氨基酸属性来探讨蛋白质结合不同类型分子的结合位点的理化特征的共性和特性,并在此基础上提出了预测蛋白质与其它类型分子的结合位点(如血红素结合位点)的分类方法,然后主要从蛋白质的三维结构和拓扑结构信息出发设计出有效的特征和特征表示方法来描述和预测DNA结合残基。全文主要的研究内容概括如下:1.利用氨基酸理化属性对蛋白质与不同类型分子(蛋白质,DNA/RNA和血红素分子)结合位点的特异性特征进行分析,并提出了从序列信息预测血红素结合位点的分类方法。本工作首先从最简单直观却有着高解释性的理化特征出发,分析了蛋白质结合不同类型分子的结合位点的相关的理化特征,结果表明不同类型结合分子的结合位点具有不同的性质。然后,我们提出了一种简单直观的特征选择方法和整合序列谱编码方案,实现了基于整合序列谱预测血红蛋白的结合位点的新方法。在训练集上的交叉验证和测试集上的独立验证结果均表明了我们的方法与文献中已有报道的结果相比,在预测精度上得到了较大的提高。2.DNA结合残基预测模型中的特征设计与分析。本工作首先构建了基准数据集,该数据集整合了蛋白质绑定DNA前后的结构数据,然后引入了新的结构特征包括温度因子、包装密度和拓扑结构特征来描述DNA绑定蛋白和对应的非绑定蛋白上的结合残基,利用新特征对结合残基的分析结果能给分子生物学家提供有用的信息。3.提出了基于特征降维策略的DNA结合残基预测模型。在我们前面工作中对DNA结合残基的特征设计和分析的基础上,进一步提出了权值因子来定量描述周围氨基酸对中心氨基酸依赖距离的贡献,然后通过提取表面补缀上的加权平均特征进行特征降维,在此基础上实现了基于加权平均的降维特征集预测DNA结合残基的新方法,实验结果表明,本章提出的新方法相比现有文献中的机器学习方法更有更高的效率和预测精度,同时该方法中提出的加权平均的降维策略可以扩展应用到其它类型的结合残基预测研究中。

【Abstract】 With the accomplishment of genome sequencing projects of human and other species, the increasing availability of genome sequencing data provides sufficient encoding information for hundreds of thousands of proteins. As the production of genetic information, proteins are the carriers of the most important biological activities and the executors of cellar functions. In biological cells, proteins perform specific functions when they interact with other molecules. However, only a part of residues on proteins are directly participating the interaction with other molecules. The interacting residues play the crucial roles in various biological functions. Therefore, the characterization and identification of functional residues or binding sites provides important clues for exploring the function of proteins.In the last decade, researchers have been focusing on the development of computational methods to predict functional residues on proteins. Especially, the machine learning-based methods are applied to the prediction of binding residues from sequence or structure-derived features. In our dissertation, we first exploit amino acid indices to analyze the physicochemical attributes specific to the different types of molecules (such as protein, DNA/RNA and heme) binding to proteins, and we propose a new classification method to predict heme binding residues from heme binding pretein sequences. More impoartantly, we mainly explore and design effective structural and topological features to characterize and predict DNA-binding residues. The outline of the research topics is listed as following:1. We exploit amino acid indices to analyze the physicochemical attributes specific to the different types of molecules (protein, DNA/RNA and heme) binding to proteins, and propose a new sequence-based method to predict heme binding residues. Our results have been shown that the different types of binding residues have their own relevant attributes. We first propose an intuitive feature selection scheme and a novel integrative sequence profile, which is generated by coupling the PSSM with the selected physicochemical properties. Evaluation experiments by using 5-fold cross validation on the training set and on the independent test demonstrate that our proposed approach outperforms the conventional methods based on PSSM profiles for prediction of heme binding residues.2. The feature design and analysis of DNA-binding residues in the prediction models. In the section, we first build the benchmark datasets, which consist of DNA-binding protein structures both in their holo and apo forms. Then, we introduce the novel features such as temperature factor, packing density and betweenness centrality, to descible DNA-binding residues on bound and unbound structures. The statistical results derived from the new features can provide useful information and knowledge to molecule biologists.3. We propose a new method using the stradegy based on dimensionality reduction to predict DNA-binding residues. In the previous section, the methods for predicting DNA-binding residues included data for neighboring residues by concatenating a number of properties, resulting in highdimensional feature vectors. To overcome the limitations, we first introduce a novel weighting factor to quantify the distance-dependent contribution of each neighboring residue in determining the location of a binding residue. Then, a weighted average scheme (dimensionality reduction) is proposed to represent the surface patch of the considering residue. Based on the above strategies, we exploit a reduced set of weighted average features to improve prediction of DNA-binding residues from structures. Experimental results indicate that our approach can predict DNA-binding residues with high accuracy and high efficiency using a reduced set of weighted average features, and compares favorably to the two previous methods. We believe that the weighted average scheme can potentially be expanded to predict other functional sites, such as protein-protein and protein-RNA interaction residues.

  • 【网络出版投稿人】 武汉大学
  • 【网络出版年期】2012年 07期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络