节点文献

基于智能计算的蛋白质相互作用预测方法研究

Research on Method of Prediction of Protein-Protein Interaction Using Intelligent Computing

【作者】 杜秀全

【导师】 程家兴;

【作者基本信息】 安徽大学 , 计算机应用技术, 2010, 博士

【摘要】 随着人类基因组计划的顺利完成,科学家们获得了大量的序列信息,进而人类由以基因组研究为主的时代踏进了以研究功能基因组为标志的后基因组时代。在后基因组时代里,蛋白质组学研究是生物信息学的重要分支之一,这是因为生物体内的各种生理功能的执行都离不开蛋白质以及蛋白质与其它配体之间的协同合作。蛋白质作为生命体的主要基本物质之一,其蛋白质间的相互作用不仅对细胞和生物通路的功能发挥着关键性的作用,而且理解这些相互作用对各种疾病的发病机理和治疗也具有极大的推进作用。因此蛋白质组学研究中一个最重要的挑战就是如何从物理和结构层次上大规模地了解蛋白质-蛋白质之间的相互作用和构建相应的蛋白质相互作用网络,一般常见的研究方法是根据其已知的蛋白质及其配体的一级结构序列,提取出序列中所包含的有用信息,利用实验和计算方法结合这些信息来预测蛋白质之间发生相互作用的可能性并建立蛋白质相互作用网络。随着X晶体衍射和核磁共振(NMR)实验技术的进步,大量蛋白质结构数据被测定出来,这些数据信息进一步促进了开发基于数据驱动的方法(计算方法)来预测蛋白质相互作用。本文采用智能计算中相关的算法围绕蛋白质相互作用里的一些基本问题进行着重研究。主要内容包括微观层次上的蛋白质相互作用位点的预测、宏观层次上的蛋白质相互作用预测以及蛋白质相互作用网络的构建。针对这三方面的问题,我们分别进行了深入的分析并提出相对应的预测方法,详细内容分别如下:1.通过对蛋白质相互作用所形成的界面中表面残基和界面残基的分析,将覆盖算法引入到蛋白质相互作用位点预测中。该算法能够很好地结合蛋白质相互作用界面上界面残基在空间结构和一级序列结构中有聚类现象的特点,首先分别将界面残基样本和表面残基样本设想成在一个n维空间的球面上(通过某种方式的转换),然后提取两种基本的蛋白质序列特征:序列谱和溶剂可及表面积。利用覆盖算法从初始界面残基样本开始计算以该样本为中心,以与异类样本最近距离和与同类样本最远距离的一半作为半径画圆,构造一个覆盖用以覆盖同类样本,然后再以异类样本为中心,以相同方式构造覆盖,如此这样交叉进行。我们根据数据特点,实验分别构造了两种数据集(完全集和平衡集),设计了该方法和传统的机器学习算法(SVM,ME)在两种数据集上的实验。实验结果显示该算法在两种数据集上的结果是有效的,可行的。最后给出几种算法在两个复合物上相互作用位点定位的实例,进一步说明该算法对未知蛋白质相互作用位点具有较强的适应性和预测能力。2.可用于蛋白质相互作用位点预测的特征非常之多,不同的研究者使用不同的特征组合从而得到不同的结果。由于各种特征从不同角度对蛋白质相互作用位点预测提供的信息都不尽相同,其中一些特征对分类器的预测能力毫无作用,甚至可能会降低预测结果。因此,我们针对蛋白质相互作用位点预测的特征选择问题,提出了一种新的基于遗传算法(GA)和支持向量机(SVM)相组合的特征提取算法。该算法利用GA从原始基本特征所组成的110维蛋白质序列向量中提取出相对重要的68种特征,同时对提取出的特征采用SVM进行评估。我们将个体的适应度评价指标设置为算法分类能力的敏感度和特异度的均衡值F1-measure,这有利于寻找出对分类器各种性能指标相均衡的特征组合。实验分别设计了随机分类器、两阶段分类器、SVM和GA/SVM分类器。实验结果表明,这种基于GA/SVM特征提取算法的蛋白质相互作用位点预测方法具有较好的鲁棒性,取得了比原始特征和其它方法更好的性能。3.蛋白质相互作用预测的一个关键问题是如何有效地转换相互作用的蛋白质序列对信息,因为不同的蛋白质序列信息转换方法所表达的信息量会大不相同,由此产生不同的分类性能。为此,我们提出了一种称为氨基酸排序信息的蛋白质序列转换方法(伪氨基酸组成,PseAA),这种方法不仅考虑到蛋白质序列的基本氨基酸组成,同时也把氨基酸间的短程、中程和远程相互作用的影响放入蛋白质序列信息表达中。实验采用SVM对新的蛋白质序列编码方案进行学习和分类,同时为了与其它方法进行性能比较,实验也设计了另外三种转换方法,相关系数变换(CC)、自协方差表达变换(AC)和氨基酸组成(AAC)。实验结果表明,我们提出的序列编码方案的分类性能在四种方法中居于第二,四种方法中AC方法变换所产生的维数最高,达到了840;CC方法次之,为420;AAC方法最低,仅为40;我们提出的方法维数也只有100。因此,综合其性能和所需的代价,我们提出的蛋白质相互作用序列转换方法是有效的,可行的。4.蛋白质相互作用预测仅仅是从蛋白质层次进行研究,而生命体的各种功能都与其细胞内蛋白质形成的相互作用网络调控相关。为此,我们利用前面蛋白质相互作用预测方法所获得的分类器模型,从BioGrid数据库中提取出两种类型的相互作用网络数据用于测试。我们分别对两种相互作用网络中所有的蛋白质序列利用伪氨基酸组成方法转换成相应的离散化向量,然后再利用分类器模型进行预测,最后对预测的结果绘制了蛋白质相互作用网络图谱。实验结果表明,该方法在蛋白质相互作用网络的构建上也同样是有效的。

【Abstract】 With the Human Genome Project successfully completed, the scientists got a lot of sequence information, and then the human from the era of research-based genome to the era of research-based functional genomics, it also be post-genomic era. Proteomics is an important branch of research in Post-genomic era, because in vivo the implementation of a variety of physiological functions was depended on protein and protein-ligand collaboration interaction.Proteins are one of the basic materials of the life, the interaction between the proteins not only plays a key role of functions of cells and biological pathways, and understanding of these interactions on the pathogenesis of various diseases and treatment also has a positive promoting effect. One of the most important challenge on proteomics research is how to large-scale understand protein-protein interactions from the physical level and structural level and build the corresponding protein-protein interaction network, the general common research method was based on its known protein primary structure and ligand sequences, extracted useful information from sequence, using experimental or computational methods combined these information to predict the possibility of interaction between proteins and the establishment of protein interaction networks. With the progress of X crystallography and nuclear magnetic resonance (NMR) experiment technological, a lot of protein structure data to be measured out, then these data further promote the development of data-driven based method (calculation method) to predict protein interactions.In this paper, we use computational intelligence algorithms to research some basic issues on protein-protein interactions. The main contents include protein interaction sites prediction on the micro-level, prediction of protein interactions on macro-level and protein-protein interaction network construction. For these three areas, we conduct in-depth analysis and put forward the corresponding predicting methods, the details are as follows: 1 Through the analysis on surface residues and interface residues of protein-protein interface, we introduce covering algorithm to protein-protein interaction sites prediction. This algorithm can work well with the clustering phenomenon of interface residues in the spatial structure and primary sequence, respectively. The first, interface residue samples and surface residue samples are conceived on a sphere of an n-dimensional space (through some form of conversion), then extracted two basic characteristics of the protein sequence:protein sequence profile and solvent accessible surface area. We use one of the interface residue samples as the center dot, and compute the minimum distance with heterogeneous samples, the maximum distance with the same class samples, then draw a circle using the half distance of the minimum and the maximum distance and construct a cover into cover the same class samples, and then the center is changed into heterogeneous sample, construct cover in the same way, so that alternately. According to the data characteristics, we constructed two experimental datasets (Complete dataset and Trim dataset), we design experiment on our method and the traditional machine learning algorithms (SVM, ME) in the two data sets. Experimental results show that the algorithms are effective and feasible on the results of the two data sets. Finally, we give two examples about protein interaction sites location based on several algorithms, and further shows this algorithm has strong adaptability and predictability for unknown protein interaction sites2 The features of predicting protein interaction sites are very many, different researchers use different features combinations to get different results. Because of various features provided different the information from different angles on the prediction of protein interaction sites, some of which features are useless on the classifier’s predictive power at all, and may even reduce the predicting results. Therefore, we focused on feature selection of prediction of protein interaction sites, proposed a new feature extraction algorithm based on the combination of genetic algorithm (GA) and support vector machine (SVM). The algorithm extracted the relative importance 68 features from the original 110-dimensional vector of protein sequences using GA, and evaluated the extracted features by SVM. We will individual fitness function is set to Fl-measure, equilibrium value of the sensitivity and specificity, this will help to find out balanced the performance of the classifier with all parameters. We designed random classifier, two-stage classifier, SVM and GA/SVM classifier experiments. The experimental results showed that the proposed GA/SVM feature extraction algorithm is robust and made better performance than other methods and the original features.3 A key problem is how to effectively convert the protein sequence information about protein-protein interaction prediction, because different conversion methods of protein sequence information express different the amount of information, and resulting in different classification performance. So, we propose a amino acid order information method (pseudo amino acid composition, PseAA), this method not only take into account basic amino acid composition, but also short-range, medium-range and long-range interaction of amino acids in the protein expression. We use SVM to learn and classify it for new protein sequence coding scheme, while for the performance comparison with other methods, we designed three other conversion methods, such as correlation coefficient (CC), auto covariance (AC) and amino acid composition (AAC). Experimental results show that the classification performance is in the second by our proposed coding scheme of sequence. In the four methods, AC method produces the highest dimension, reaching 840; CC method followed for the 420; AAC method the lowest, only 40; our proposed dimension is 100. Therefore, from the performance and cost angle, the proposed protein-protein interaction sequence conversion method is effective and feasible.4 Protein interaction prediction is only from the protein level, but a variety of functions of life related to protein interaction network. Therefore, we use obtained classification model, and extract two types of interaction network data from the BioGrid database for testing. We convert protein sequences of interaction network into the corresponding discrete vector by pseudo amino acid composition method, and then use classification model to predict them, finally draw the map for the prediction results of protein interaction networks. Experimental results show that this method is also effective on the construction of the protein interaction network.

  • 【网络出版投稿人】 安徽大学
  • 【网络出版年期】2010年 10期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络