节点文献

蛋白质残基间的相互作用分析与预测

Analysis and Prediction of Interactions between Residues in Proteins

【作者】 陈鹏

【导师】 黄德双;

【作者基本信息】 中国科学技术大学 , 模式识别与智能系统, 2007, 博士

【摘要】 近年来,由于生物信息学的迅猛发展,大量的生物学数据亟待矫正、管理、解释以及充分的利用,而机器学习正适合这类数据量大、含有噪声的模式。许多机器学习算法也已经被成功地用来进行生物数据的处理并挖掘未知的生物学知识。本文主要针对支持向量机这个机器学习工具来进行蛋白质结构的分析,着重运用支持向量机和遗传算法(GA)来进行蛋白质残基温度因子(B-factor)的预测,并由此进行残基间的远程相互作用的分析和预测。全文的主要工作包括如下:1、提出了利用多类支持向量机进行蛋白质残基温度因子的分析和预测的方法。一般来说,残基的B-factor代表它的一种热不稳定状态或自由活动的程度,较高的残基B-factor对应着较大的外露表面面积。因此预测残基的B-factor会有助于理解和预测蛋白质的结构。本文主要介绍了所采取的氨基酸的物理化学特征,例如:蛋白质的序列谱、残基的进化速率、残基的疏水值,作为多类支持向量机的输入来进行蛋白质残基B-fattor的分析和预测。2、提出了一种基于预测的残基B-factor以及疏水谱特征的支持向量机方法,来进行残基间的接触聚类中心分析和预测。在蛋白质的接触图谱中,残基间的远程相互作用点往往聚集在一起形成一个个的聚类。分析发现,这些聚类大部分都对应着较低的残基B-factor区域或较强的疏水区域,基于此特点而进行的有选择性的样本抽取就可以降低正负样本数据的不平衡性,从而得到较高的预测性能。最后,我们利用支持向量机来预测残基间的接触聚类中心,并由此得到残基间的相互作用位点。3、构建了一种基于蛋白质序列谱中心的遗传算法,来分析残基间的序列谱中心,并以此预测残基间的远程相互作用位点。首先我们运用一种基于遗传算法的多分类器(Genetic algorithm based Multi-Classification,GaMC)系统分析发现,大部分的残基间远程作用位点位于序列谱中心的周围,采用此分类器就可以把接触和不接触残基对给分离开来,从而能够预测到某残基对是否具有远程相互作用。

【Abstract】 Recent years, more and more biological data are needing to be corrected, managed, explained, and sufficiently utilized because of the speedy development of the bioinformatics. However, machine learning methods are just suitable for handling these data with huge size and noise. So far, many machine learning algorithms have been successfully used to deal with those huge biological data, and to mine and discover unknown biological knowledge. This thesis mainly uses machine learning tool such as support vector machine (SVM) to analyze protein structure, and adopts SVM and genetic algorithm (GA) to predict residue’ s temperature factor (B-factor) as well as predict long-range contact between residues. The main works for this thesis are introduced as follows:1. A multi-class support vector machine (SVM) based prediction method was proposed in this thesis to analyze and predict B-factors of residues of protein. In general, the temperature factor or B-factor of residue, which is linearly related to the mean square displacement of its C_αatom,indicates the atomic flexibility in the crystalline state. Previous works have shown that hydrophobic residues, which are usually buried, tend to be more rigid whereas charged residues tend to be more flexible. Consequently, the prediction of the B-factor may help to understand and predict the three-dimensional structure of protein. In conclusion, this thesis mainly makes use of some selected properties of amino acid residue, such as sequence profile of protein chain, evolutionary rate of residue, and hydrophobic value of residue, as the input for multi-class support vector machine to analyze and predict the B-factor of residue.2. A prediction approach was proposed to predict the inter-residues contact cluster centers based on predicted residue B-factor, hydrophobic value of residure and support vector machine. It is general knowledge that inter-residues contacts are always gathered together to form the clusters in contact maps of proteins. Observation can be seen that almost all inter-residues contact clusters correspond to pairs of residues with local lowest-B-factor or within higher hydrophobic areas. Moreover, selectively extracting input vector for predictor based on these characteristics can reduce the imbalance of positive-negative sample data. Thus, higher prediction performance can be obtained. After that, SVM was used to predict inter-residues contact cluster centers. As a result, inter-residues interacting sites can be obtained.3. A genetic algorithm based on sequence profile (SP) centers of residue pairs was constructed to predict the sequence profile centers of the inter-residues as well as long-range interacting sites of the inter-residues. Firstly, we constructed a genetic algorithm-based multiple classifier (GaMC), and discovered that most long-range contacts are clustered around their SP centers. Secondly, using the GaMC predictor may separate residue pairs in contacts from those in non-contacts. Finally, we can make a decision whether or not two residues are in long-range contact based on the GaMC predictor and SP centers.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络